domain_extractor 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.rubocop.yml +20 -0
- data/CHANGELOG.md +35 -0
- data/LICENSE.txt +21 -0
- data/README.md +191 -0
- data/lib/domain_extractor/normalizer.rb +25 -0
- data/lib/domain_extractor/parser.rb +54 -0
- data/lib/domain_extractor/query_params.rb +29 -0
- data/lib/domain_extractor/result.rb +27 -0
- data/lib/domain_extractor/validators.rb +18 -0
- data/lib/domain_extractor/version.rb +5 -0
- data/lib/domain_extractor.rb +39 -0
- data/spec/domain_extractor_spec.rb +273 -0
- data/spec/spec_helper.rb +9 -0
- metadata +85 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 37c8d6a6c7aaf1e053679211077a3fd0bf3fb3c656045281345c6937ae8e1a45
|
|
4
|
+
data.tar.gz: d7d1446b28ef6224e820b5eef4749b130f1842a4e4d5b2459df21f3ef7061bbd
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 5c59282f9768a561232d8b5c170b779a72e62a9cd6c37044de6255be2a9dfc8134609691c8c9f1fbde0c0e83946a12fcafeead889dcd14be06a3bb4dbbf26ffe
|
|
7
|
+
data.tar.gz: eef1599730b04e0b0423bf4f018d3c993b4e92969be5887524ae0130bb1247c53b081d92d9cee3e57272b69c60bd8177a644c50e7481c9c5d81c45084a8eed65
|
data/.rubocop.yml
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
AllCops:
|
|
2
|
+
NewCops: enable
|
|
3
|
+
TargetRubyVersion: 2.7
|
|
4
|
+
Exclude:
|
|
5
|
+
- 'bin/**/*'
|
|
6
|
+
- 'tmp/**/*'
|
|
7
|
+
|
|
8
|
+
require:
|
|
9
|
+
- rubocop-performance
|
|
10
|
+
- rubocop-rspec
|
|
11
|
+
|
|
12
|
+
Metrics/BlockLength:
|
|
13
|
+
Exclude:
|
|
14
|
+
- 'spec/**/*.rb'
|
|
15
|
+
|
|
16
|
+
Metrics/MethodLength:
|
|
17
|
+
Max: 25
|
|
18
|
+
|
|
19
|
+
Style/Documentation:
|
|
20
|
+
Enabled: false
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [Unreleased]
|
|
9
|
+
|
|
10
|
+
## [0.1.0] - 2025-10-31
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
|
|
14
|
+
- Initial release of DomainExtractor
|
|
15
|
+
- Core `parse` method for extracting domain components from URLs
|
|
16
|
+
- Support for multi-part TLDs using PublicSuffix gem
|
|
17
|
+
- Nested subdomain parsing (e.g., api.staging.example.com)
|
|
18
|
+
- URL normalization (handles URLs with or without schemes)
|
|
19
|
+
- Path extraction from URLs
|
|
20
|
+
- Query parameter parsing via `parse_query_params` method
|
|
21
|
+
- Batch URL processing with `parse_batch` method
|
|
22
|
+
- IP address detection (IPv4 and IPv6)
|
|
23
|
+
- Comprehensive test suite with 100% coverage
|
|
24
|
+
- Full documentation and usage examples
|
|
25
|
+
|
|
26
|
+
### Features
|
|
27
|
+
|
|
28
|
+
- Extract subdomain, domain, TLD, root_domain, and host from URLs
|
|
29
|
+
- Handle complex multi-part TLDs (co.uk, com.au, gov.br, etc.)
|
|
30
|
+
- Parse query strings into structured hashes
|
|
31
|
+
- Process multiple URLs efficiently
|
|
32
|
+
- Robust error handling for invalid inputs
|
|
33
|
+
|
|
34
|
+
[Unreleased]: https://github.com/opensite-ai/domain_extractor/compare/v0.1.0...HEAD
|
|
35
|
+
[0.1.0]: https://github.com/opensite-ai/domain_extractor/releases/tag/v0.1.0
|
data/LICENSE.txt
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 OpenSite AI
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
data/README.md
ADDED
|
@@ -0,0 +1,191 @@
|
|
|
1
|
+
# DomainExtractor
|
|
2
|
+
|
|
3
|
+
[](https://badge.fury.io/rb/domain_extractor)
|
|
4
|
+
[](https://github.com/opensite-ai/domain_extractor/actions)
|
|
5
|
+
[](https://codeclimate.com/github/opensite-ai/domain_extractor)
|
|
6
|
+
|
|
7
|
+
A lightweight, robust Ruby library for url parsing and domain parsing with **accurate multi-part TLD support**. DomainExtractor delivers a high-throughput url parser and domain parser that excels at domain extraction tasks while staying friendly to analytics pipelines. Perfect for web scraping, analytics, url manipulation, query parameter parsing, and multi-environment domain analysis.
|
|
8
|
+
|
|
9
|
+
Use DomainExtractor whenever you need a dependable tld parser for tricky multi-part tld registries or reliable subdomain extraction in production systems.
|
|
10
|
+
|
|
11
|
+
## Why DomainExtractor?
|
|
12
|
+
|
|
13
|
+
✅ **Accurate Multi-part TLD Parser** - Handles complex multi-part TLDs (co.uk, com.au, gov.br) using the [Public Suffix List](https://publicsuffix.org/)
|
|
14
|
+
✅ **Nested Subdomain Extraction** - Correctly parses multi-level subdomains (api.staging.example.com)
|
|
15
|
+
✅ **Smart URL Normalization** - Automatically handles URLs with or without schemes
|
|
16
|
+
✅ **Query Parameter Parsing** - Parse query strings into structured hashes
|
|
17
|
+
✅ **Batch Processing** - Parse multiple URLs efficiently
|
|
18
|
+
✅ **IP Address Detection** - Identifies and handles IPv4 and IPv6 addresses
|
|
19
|
+
✅ **Zero Configuration** - Works out of the box with sensible defaults
|
|
20
|
+
✅ **Well-Tested** - Comprehensive test suite covering edge cases
|
|
21
|
+
|
|
22
|
+
## Installation
|
|
23
|
+
|
|
24
|
+
Add this line to your application's Gemfile:
|
|
25
|
+
|
|
26
|
+
```ruby
|
|
27
|
+
gem 'domain_extractor'
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
And then execute:
|
|
31
|
+
|
|
32
|
+
```bash
|
|
33
|
+
$ bundle install
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
Or install it yourself:
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
$ gem install domain_extractor
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Quick Start
|
|
43
|
+
|
|
44
|
+
```ruby
|
|
45
|
+
require 'domain_extractor'
|
|
46
|
+
|
|
47
|
+
# Parse a URL
|
|
48
|
+
result = DomainExtractor.parse('https://www.example.co.uk/path?query=value')
|
|
49
|
+
|
|
50
|
+
result[:subdomain] # => 'www'
|
|
51
|
+
result[:domain] # => 'example'
|
|
52
|
+
result[:tld] # => 'co.uk'
|
|
53
|
+
result[:root_domain] # => 'example.co.uk'
|
|
54
|
+
result[:host] # => 'www.example.co.uk'
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## Usage Examples
|
|
58
|
+
|
|
59
|
+
### Basic Domain Parsing
|
|
60
|
+
|
|
61
|
+
```ruby
|
|
62
|
+
# Parse a simple domain (fast domain extraction)
|
|
63
|
+
DomainExtractor.parse('example.com')
|
|
64
|
+
# => { subdomain: nil, domain: 'example', tld: 'com', ... }
|
|
65
|
+
|
|
66
|
+
# Parse domain with subdomain
|
|
67
|
+
DomainExtractor.parse('blog.example.com')
|
|
68
|
+
# => { subdomain: 'blog', domain: 'example', tld: 'com', ... }
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Multi-Part TLD Support
|
|
72
|
+
|
|
73
|
+
```ruby
|
|
74
|
+
# UK domain
|
|
75
|
+
DomainExtractor.parse('www.bbc.co.uk')
|
|
76
|
+
# => { subdomain: 'www', domain: 'bbc', tld: 'co.uk', ... }
|
|
77
|
+
|
|
78
|
+
# Australian domain
|
|
79
|
+
DomainExtractor.parse('shop.example.com.au')
|
|
80
|
+
# => { subdomain: 'shop', domain: 'example', tld: 'com.au', ... }
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Nested Subdomains
|
|
84
|
+
|
|
85
|
+
```ruby
|
|
86
|
+
DomainExtractor.parse('api.staging.example.com')
|
|
87
|
+
# => { subdomain: 'api.staging', domain: 'example', tld: 'com', ... }
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### Query Parameter Parsing
|
|
91
|
+
|
|
92
|
+
```ruby
|
|
93
|
+
params = DomainExtractor.parse_query_params('?utm_source=google&page=1')
|
|
94
|
+
# => { 'utm_source' => 'google', 'page' => '1' }
|
|
95
|
+
|
|
96
|
+
# Or via the shorter helper
|
|
97
|
+
DomainExtractor.parse_query('?search=ruby&flag')
|
|
98
|
+
# => { 'search' => 'ruby', 'flag' => nil }
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Batch URL Processing
|
|
102
|
+
|
|
103
|
+
```ruby
|
|
104
|
+
urls = ['https://example.com', 'https://blog.example.org']
|
|
105
|
+
results = DomainExtractor.parse_batch(urls)
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## API Reference
|
|
109
|
+
|
|
110
|
+
### `DomainExtractor.parse(url_string)`
|
|
111
|
+
|
|
112
|
+
Parses a URL string and extracts domain components.
|
|
113
|
+
|
|
114
|
+
**Returns:** Hash with keys `:subdomain`, `:domain`, `:tld`, `:root_domain`, `:host`, `:path` or `nil`
|
|
115
|
+
|
|
116
|
+
### `DomainExtractor.parse_batch(urls)`
|
|
117
|
+
|
|
118
|
+
Parses multiple URLs efficiently.
|
|
119
|
+
|
|
120
|
+
**Returns:** Array of parsed results
|
|
121
|
+
|
|
122
|
+
### `DomainExtractor.parse_query_params(query_string)`
|
|
123
|
+
|
|
124
|
+
Parses a query string into a hash of parameters.
|
|
125
|
+
|
|
126
|
+
**Returns:** Hash of query parameters
|
|
127
|
+
|
|
128
|
+
## Use Cases
|
|
129
|
+
|
|
130
|
+
**Web Scraping**
|
|
131
|
+
|
|
132
|
+
```ruby
|
|
133
|
+
urls = scrape_page_links(page)
|
|
134
|
+
domains = urls.map { |url| DomainExtractor.parse(url)&.dig(:root_domain) }.compact.uniq
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
**Analytics & Tracking**
|
|
138
|
+
|
|
139
|
+
```ruby
|
|
140
|
+
referrer = request.referrer
|
|
141
|
+
parsed = DomainExtractor.parse(referrer)
|
|
142
|
+
track_event('page_view', source_domain: parsed[:root_domain]) if parsed
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
**Domain Validation**
|
|
146
|
+
|
|
147
|
+
```ruby
|
|
148
|
+
def internal_link?(url, base_domain)
|
|
149
|
+
parsed = DomainExtractor.parse(url)
|
|
150
|
+
parsed && parsed[:root_domain] == base_domain
|
|
151
|
+
end
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
## Performance
|
|
155
|
+
|
|
156
|
+
- **Single URL parsing**: ~0.0001s per URL
|
|
157
|
+
- **Batch domain extraction**: ~0.01s for 100 URLs
|
|
158
|
+
- **Memory efficient**: Minimal object allocation
|
|
159
|
+
- **Thread-safe**: Can be used in concurrent environments
|
|
160
|
+
|
|
161
|
+
## Comparison with Alternatives
|
|
162
|
+
|
|
163
|
+
| Feature | DomainExtractor | Addressable | URI (stdlib) |
|
|
164
|
+
| --------------------------- | --------------- | ----------- | ------------ |
|
|
165
|
+
| Multi-part TLD parser | ✅ | ❌ | ❌ |
|
|
166
|
+
| Subdomain extraction | ✅ | ❌ | ❌ |
|
|
167
|
+
| Domain component separation | ✅ | ❌ | ❌ |
|
|
168
|
+
| Built-in url normalization | ✅ | ❌ | ❌ |
|
|
169
|
+
| Lightweight | ✅ | ❌ | ✅ |
|
|
170
|
+
|
|
171
|
+
## Requirements
|
|
172
|
+
|
|
173
|
+
- Ruby 2.7.0 or higher
|
|
174
|
+
- public_suffix gem (~> 6.0)
|
|
175
|
+
|
|
176
|
+
## Contributing
|
|
177
|
+
|
|
178
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/opensite-ai/domain_extractor.
|
|
179
|
+
|
|
180
|
+
## License
|
|
181
|
+
|
|
182
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
|
183
|
+
|
|
184
|
+
## Acknowledgments
|
|
185
|
+
|
|
186
|
+
- Built on Ruby's standard [URI library](https://ruby-doc.org/stdlib/libdoc/uri/rdoc/URI.html)
|
|
187
|
+
- Uses the [public_suffix gem](https://github.com/weppos/publicsuffix-ruby) for accurate TLD parsing
|
|
188
|
+
|
|
189
|
+
---
|
|
190
|
+
|
|
191
|
+
Made with ❤️ by [OpenSite AI](https://opensite.ai)
|
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module DomainExtractor
|
|
4
|
+
# Normalizer ensures URLs include a scheme and removes extraneous whitespace
|
|
5
|
+
# before passing them into the URI parser.
|
|
6
|
+
module Normalizer
|
|
7
|
+
SCHEME_PATTERN = %r{\A[A-Za-z][A-Za-z0-9+\-.]*://}.freeze
|
|
8
|
+
|
|
9
|
+
module_function
|
|
10
|
+
|
|
11
|
+
def call(input)
|
|
12
|
+
return if input.nil?
|
|
13
|
+
|
|
14
|
+
string = coerce_to_string(input)
|
|
15
|
+
return if string.empty?
|
|
16
|
+
|
|
17
|
+
string.match?(SCHEME_PATTERN) ? string : "https://#{string}"
|
|
18
|
+
end
|
|
19
|
+
|
|
20
|
+
def coerce_to_string(value)
|
|
21
|
+
value.respond_to?(:to_str) ? value.to_str.strip : value.to_s.strip
|
|
22
|
+
end
|
|
23
|
+
private_class_method :coerce_to_string
|
|
24
|
+
end
|
|
25
|
+
end
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'uri'
|
|
4
|
+
require 'public_suffix'
|
|
5
|
+
|
|
6
|
+
require_relative 'normalizer'
|
|
7
|
+
require_relative 'result'
|
|
8
|
+
require_relative 'validators'
|
|
9
|
+
|
|
10
|
+
module DomainExtractor
|
|
11
|
+
# Parser orchestrates the pipeline for url normalization, validation, and domain extraction.
|
|
12
|
+
module Parser
|
|
13
|
+
module_function
|
|
14
|
+
|
|
15
|
+
def call(raw_url)
|
|
16
|
+
uri = build_uri(raw_url)
|
|
17
|
+
return unless uri
|
|
18
|
+
|
|
19
|
+
host = uri.host&.downcase
|
|
20
|
+
return if invalid_host?(host)
|
|
21
|
+
|
|
22
|
+
domain = ::PublicSuffix.parse(host)
|
|
23
|
+
build_result(domain: domain, host: host, uri: uri)
|
|
24
|
+
rescue ::URI::InvalidURIError, ::PublicSuffix::Error
|
|
25
|
+
nil
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
def build_uri(raw_url)
|
|
29
|
+
normalized = Normalizer.call(raw_url)
|
|
30
|
+
return unless normalized
|
|
31
|
+
|
|
32
|
+
::URI.parse(normalized)
|
|
33
|
+
end
|
|
34
|
+
private_class_method :build_uri
|
|
35
|
+
|
|
36
|
+
def invalid_host?(host)
|
|
37
|
+
host.nil? || Validators.ip_address?(host) || !::PublicSuffix.valid?(host)
|
|
38
|
+
end
|
|
39
|
+
private_class_method :invalid_host?
|
|
40
|
+
|
|
41
|
+
def build_result(domain:, host:, uri:)
|
|
42
|
+
Result.build(
|
|
43
|
+
subdomain: domain.trd,
|
|
44
|
+
root_domain: domain.domain,
|
|
45
|
+
domain: domain.sld,
|
|
46
|
+
tld: domain.tld,
|
|
47
|
+
host: host,
|
|
48
|
+
path: uri.path,
|
|
49
|
+
query: uri.query
|
|
50
|
+
)
|
|
51
|
+
end
|
|
52
|
+
private_class_method :build_result
|
|
53
|
+
end
|
|
54
|
+
end
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'uri'
|
|
4
|
+
|
|
5
|
+
module DomainExtractor
|
|
6
|
+
# QueryParams transforms URL query strings into Ruby hashes.
|
|
7
|
+
module QueryParams
|
|
8
|
+
EMPTY = {}.freeze
|
|
9
|
+
|
|
10
|
+
module_function
|
|
11
|
+
|
|
12
|
+
def call(raw_query)
|
|
13
|
+
return EMPTY if raw_query.nil? || raw_query.empty?
|
|
14
|
+
|
|
15
|
+
::URI.decode_www_form(raw_query, Encoding::UTF_8).each_with_object({}) do |(key, value), params|
|
|
16
|
+
next if key.nil? || key.empty?
|
|
17
|
+
|
|
18
|
+
params[key] = normalize_value(value)
|
|
19
|
+
end
|
|
20
|
+
rescue ArgumentError
|
|
21
|
+
EMPTY
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
def normalize_value(value)
|
|
25
|
+
value.nil? || value.empty? ? nil : value
|
|
26
|
+
end
|
|
27
|
+
private_class_method :normalize_value
|
|
28
|
+
end
|
|
29
|
+
end
|
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module DomainExtractor
|
|
4
|
+
# Result encapsulates the final parsed attributes and exposes a hash interface.
|
|
5
|
+
module Result
|
|
6
|
+
EMPTY_PATH = ''
|
|
7
|
+
|
|
8
|
+
module_function
|
|
9
|
+
|
|
10
|
+
def build(**attributes)
|
|
11
|
+
{
|
|
12
|
+
subdomain: normalize_subdomain(attributes[:subdomain]),
|
|
13
|
+
root_domain: attributes[:root_domain],
|
|
14
|
+
domain: attributes[:domain],
|
|
15
|
+
tld: attributes[:tld],
|
|
16
|
+
host: attributes[:host],
|
|
17
|
+
path: attributes[:path] || EMPTY_PATH,
|
|
18
|
+
query_params: QueryParams.call(attributes[:query])
|
|
19
|
+
}
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
def normalize_subdomain(value)
|
|
23
|
+
value.nil? || value.empty? ? nil : value
|
|
24
|
+
end
|
|
25
|
+
private_class_method :normalize_subdomain
|
|
26
|
+
end
|
|
27
|
+
end
|
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module DomainExtractor
|
|
4
|
+
# Validators hosts fast checks for excluding unsupported hostnames (e.g. IP addresses).
|
|
5
|
+
module Validators
|
|
6
|
+
IPV4_SEGMENT = '(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)'
|
|
7
|
+
IPV4_REGEX = /\A#{IPV4_SEGMENT}(?:\.#{IPV4_SEGMENT}){3}\z/.freeze
|
|
8
|
+
IPV6_REGEX = /\A\[?[0-9a-fA-F:]+\]?\z/.freeze
|
|
9
|
+
|
|
10
|
+
module_function
|
|
11
|
+
|
|
12
|
+
def ip_address?(host)
|
|
13
|
+
return false if host.nil? || host.empty?
|
|
14
|
+
|
|
15
|
+
host.match?(IPV4_REGEX) || host.match?(IPV6_REGEX)
|
|
16
|
+
end
|
|
17
|
+
end
|
|
18
|
+
end
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'uri'
|
|
4
|
+
require 'public_suffix'
|
|
5
|
+
|
|
6
|
+
require_relative 'domain_extractor/version'
|
|
7
|
+
require_relative 'domain_extractor/parser'
|
|
8
|
+
require_relative 'domain_extractor/query_params'
|
|
9
|
+
|
|
10
|
+
# DomainExtractor provides a high-performance API for url parsing and domain parsing.
|
|
11
|
+
# It exposes simple helpers for single URL normalization, domain extraction, and batch operations.
|
|
12
|
+
module DomainExtractor
|
|
13
|
+
class << self
|
|
14
|
+
# Parse an individual URL and extract domain attributes.
|
|
15
|
+
# @param url [String, #to_s]
|
|
16
|
+
# @return [Hash, nil]
|
|
17
|
+
def parse(url)
|
|
18
|
+
Parser.call(url)
|
|
19
|
+
end
|
|
20
|
+
|
|
21
|
+
# Parse many URLs and return their individual parse results.
|
|
22
|
+
# @param urls [Enumerable<String>]
|
|
23
|
+
# @return [Array<Hash, nil>]
|
|
24
|
+
def parse_batch(urls)
|
|
25
|
+
return [] unless urls.respond_to?(:map)
|
|
26
|
+
|
|
27
|
+
urls.map { |url| parse(url) }
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
# Convert a query string into a Hash representation.
|
|
31
|
+
# @param query_string [String, nil]
|
|
32
|
+
# @return [Hash]
|
|
33
|
+
def parse_query_params(query_string)
|
|
34
|
+
QueryParams.call(query_string)
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
alias parse_query parse_query_params
|
|
38
|
+
end
|
|
39
|
+
end
|
|
@@ -0,0 +1,273 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'spec_helper'
|
|
4
|
+
|
|
5
|
+
RSpec.describe DomainExtractor do
|
|
6
|
+
describe '.parse' do
|
|
7
|
+
context 'with valid URLs' do
|
|
8
|
+
it 'parses a simple domain without subdomain' do
|
|
9
|
+
result = described_class.parse('dashtrack.com')
|
|
10
|
+
|
|
11
|
+
expect(result[:subdomain]).to be_nil
|
|
12
|
+
expect(result[:root_domain]).to eq('dashtrack.com')
|
|
13
|
+
expect(result[:domain]).to eq('dashtrack')
|
|
14
|
+
expect(result[:tld]).to eq('com')
|
|
15
|
+
expect(result[:host]).to eq('dashtrack.com')
|
|
16
|
+
end
|
|
17
|
+
|
|
18
|
+
it 'parses a domain with www subdomain' do
|
|
19
|
+
result = described_class.parse('www.insurancesite.ai')
|
|
20
|
+
|
|
21
|
+
expect(result[:subdomain]).to eq('www')
|
|
22
|
+
expect(result[:root_domain]).to eq('insurancesite.ai')
|
|
23
|
+
expect(result[:domain]).to eq('insurancesite')
|
|
24
|
+
expect(result[:tld]).to eq('ai')
|
|
25
|
+
expect(result[:host]).to eq('www.insurancesite.ai')
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
it 'parses a full URL with path' do
|
|
29
|
+
result = described_class.parse('https://hitting.com/index')
|
|
30
|
+
|
|
31
|
+
expect(result[:subdomain]).to be_nil
|
|
32
|
+
expect(result[:root_domain]).to eq('hitting.com')
|
|
33
|
+
expect(result[:domain]).to eq('hitting')
|
|
34
|
+
expect(result[:tld]).to eq('com')
|
|
35
|
+
expect(result[:host]).to eq('hitting.com')
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
it 'parses domains with multi-part TLDs' do
|
|
39
|
+
result = described_class.parse('https://subdomain.example.co.uk')
|
|
40
|
+
|
|
41
|
+
expect(result[:subdomain]).to eq('subdomain')
|
|
42
|
+
expect(result[:root_domain]).to eq('example.co.uk')
|
|
43
|
+
expect(result[:domain]).to eq('example')
|
|
44
|
+
expect(result[:tld]).to eq('co.uk')
|
|
45
|
+
expect(result[:host]).to eq('subdomain.example.co.uk')
|
|
46
|
+
end
|
|
47
|
+
|
|
48
|
+
it 'parses domains with multiple subdomain levels' do
|
|
49
|
+
result = described_class.parse('https://api.staging.example.com')
|
|
50
|
+
|
|
51
|
+
expect(result[:subdomain]).to eq('api.staging')
|
|
52
|
+
expect(result[:root_domain]).to eq('example.com')
|
|
53
|
+
expect(result[:domain]).to eq('example')
|
|
54
|
+
expect(result[:tld]).to eq('com')
|
|
55
|
+
expect(result[:host]).to eq('api.staging.example.com')
|
|
56
|
+
end
|
|
57
|
+
|
|
58
|
+
it 'handles URLs with ports' do
|
|
59
|
+
result = described_class.parse('https://www.example.com:8080/path')
|
|
60
|
+
|
|
61
|
+
expect(result[:subdomain]).to eq('www')
|
|
62
|
+
expect(result[:root_domain]).to eq('example.com')
|
|
63
|
+
expect(result[:domain]).to eq('example')
|
|
64
|
+
expect(result[:tld]).to eq('com')
|
|
65
|
+
expect(result[:host]).to eq('www.example.com')
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
it 'handles URLs with query parameters' do
|
|
69
|
+
result = described_class.parse('https://example.com/page?param=value')
|
|
70
|
+
|
|
71
|
+
expect(result[:subdomain]).to be_nil
|
|
72
|
+
expect(result[:root_domain]).to eq('example.com')
|
|
73
|
+
expect(result[:domain]).to eq('example')
|
|
74
|
+
expect(result[:tld]).to eq('com')
|
|
75
|
+
expect(result[:host]).to eq('example.com')
|
|
76
|
+
expect(result[:path]).to eq('/page')
|
|
77
|
+
expect(result[:query_params]).to eq({ 'param' => 'value' })
|
|
78
|
+
end
|
|
79
|
+
|
|
80
|
+
it 'extracts path from URLs' do
|
|
81
|
+
result = described_class.parse('https://example.com/path/to/page')
|
|
82
|
+
|
|
83
|
+
expect(result[:path]).to eq('/path/to/page')
|
|
84
|
+
expect(result[:query_params]).to eq({})
|
|
85
|
+
end
|
|
86
|
+
|
|
87
|
+
it 'extracts multiple query parameters' do
|
|
88
|
+
result = described_class.parse('https://example.com/page?foo=bar&baz=qux&id=123')
|
|
89
|
+
|
|
90
|
+
expect(result[:query_params]).to eq({
|
|
91
|
+
'foo' => 'bar',
|
|
92
|
+
'baz' => 'qux',
|
|
93
|
+
'id' => '123'
|
|
94
|
+
})
|
|
95
|
+
end
|
|
96
|
+
|
|
97
|
+
it 'handles URLs with path and multiple query parameters' do
|
|
98
|
+
result = described_class.parse('https://api.example.com/v1/users?page=2&limit=10')
|
|
99
|
+
|
|
100
|
+
expect(result[:subdomain]).to eq('api')
|
|
101
|
+
expect(result[:root_domain]).to eq('example.com')
|
|
102
|
+
expect(result[:path]).to eq('/v1/users')
|
|
103
|
+
expect(result[:query_params]).to eq({
|
|
104
|
+
'page' => '2',
|
|
105
|
+
'limit' => '10'
|
|
106
|
+
})
|
|
107
|
+
end
|
|
108
|
+
|
|
109
|
+
it 'handles URLs with empty query string' do
|
|
110
|
+
result = described_class.parse('https://example.com/page?')
|
|
111
|
+
|
|
112
|
+
expect(result[:path]).to eq('/page')
|
|
113
|
+
expect(result[:query_params]).to eq({})
|
|
114
|
+
end
|
|
115
|
+
|
|
116
|
+
it 'handles URLs without path (root)' do
|
|
117
|
+
result = described_class.parse('https://example.com')
|
|
118
|
+
|
|
119
|
+
expect(result[:path]).to eq('')
|
|
120
|
+
expect(result[:query_params]).to eq({})
|
|
121
|
+
end
|
|
122
|
+
|
|
123
|
+
it 'handles query parameters with empty values' do
|
|
124
|
+
result = described_class.parse('https://example.com?key=')
|
|
125
|
+
|
|
126
|
+
expect(result[:query_params]).to eq({ 'key' => nil })
|
|
127
|
+
end
|
|
128
|
+
|
|
129
|
+
it 'handles query parameters without values' do
|
|
130
|
+
result = described_class.parse('https://example.com?flag')
|
|
131
|
+
|
|
132
|
+
expect(result[:query_params]).to eq({ 'flag' => nil })
|
|
133
|
+
end
|
|
134
|
+
|
|
135
|
+
it 'normalizes URLs without a scheme' do
|
|
136
|
+
result = described_class.parse('example.com/path?id=1')
|
|
137
|
+
|
|
138
|
+
expect(result[:root_domain]).to eq('example.com')
|
|
139
|
+
expect(result[:path]).to eq('/path')
|
|
140
|
+
expect(result[:query_params]).to eq({ 'id' => '1' })
|
|
141
|
+
end
|
|
142
|
+
end
|
|
143
|
+
|
|
144
|
+
context 'with invalid URLs' do
|
|
145
|
+
it 'returns nil for malformed URLs' do
|
|
146
|
+
expect(described_class.parse('http://')).to be_nil
|
|
147
|
+
end
|
|
148
|
+
|
|
149
|
+
it 'returns nil for invalid domains' do
|
|
150
|
+
expect(described_class.parse('not_a_url')).to be_nil
|
|
151
|
+
end
|
|
152
|
+
|
|
153
|
+
it 'returns nil for IP addresses' do
|
|
154
|
+
expect(described_class.parse('192.168.1.1')).to be_nil
|
|
155
|
+
end
|
|
156
|
+
|
|
157
|
+
it 'returns nil for IPv6 addresses' do
|
|
158
|
+
expect(described_class.parse('[2001:db8::1]')).to be_nil
|
|
159
|
+
end
|
|
160
|
+
|
|
161
|
+
it 'returns nil for empty string' do
|
|
162
|
+
expect(described_class.parse('')).to be_nil
|
|
163
|
+
end
|
|
164
|
+
|
|
165
|
+
it 'returns nil for nil' do
|
|
166
|
+
expect(described_class.parse(nil)).to be_nil
|
|
167
|
+
end
|
|
168
|
+
end
|
|
169
|
+
end
|
|
170
|
+
|
|
171
|
+
describe '.parse_query_params' do
|
|
172
|
+
it 'converts simple query string to hash' do
|
|
173
|
+
result = described_class.parse_query_params('foo=bar')
|
|
174
|
+
|
|
175
|
+
expect(result).to eq({ 'foo' => 'bar' })
|
|
176
|
+
end
|
|
177
|
+
|
|
178
|
+
it 'converts multiple parameters to hash' do
|
|
179
|
+
result = described_class.parse_query_params('foo=bar&baz=qux&id=123')
|
|
180
|
+
|
|
181
|
+
expect(result).to eq({
|
|
182
|
+
'foo' => 'bar',
|
|
183
|
+
'baz' => 'qux',
|
|
184
|
+
'id' => '123'
|
|
185
|
+
})
|
|
186
|
+
end
|
|
187
|
+
|
|
188
|
+
it 'returns empty hash for nil query' do
|
|
189
|
+
result = described_class.parse_query_params(nil)
|
|
190
|
+
|
|
191
|
+
expect(result).to eq({})
|
|
192
|
+
end
|
|
193
|
+
|
|
194
|
+
it 'returns empty hash for empty string query' do
|
|
195
|
+
result = described_class.parse_query_params('')
|
|
196
|
+
|
|
197
|
+
expect(result).to eq({})
|
|
198
|
+
end
|
|
199
|
+
|
|
200
|
+
it 'handles parameters with empty values' do
|
|
201
|
+
result = described_class.parse_query_params('key=')
|
|
202
|
+
|
|
203
|
+
expect(result).to eq({ 'key' => nil })
|
|
204
|
+
end
|
|
205
|
+
|
|
206
|
+
it 'handles parameters without values' do
|
|
207
|
+
result = described_class.parse_query_params('flag')
|
|
208
|
+
|
|
209
|
+
expect(result).to eq({ 'flag' => nil })
|
|
210
|
+
end
|
|
211
|
+
|
|
212
|
+
it 'handles mixed parameters with and without values' do
|
|
213
|
+
result = described_class.parse_query_params('foo=bar&flag&baz=qux')
|
|
214
|
+
|
|
215
|
+
expect(result).to eq({
|
|
216
|
+
'foo' => 'bar',
|
|
217
|
+
'flag' => nil,
|
|
218
|
+
'baz' => 'qux'
|
|
219
|
+
})
|
|
220
|
+
end
|
|
221
|
+
|
|
222
|
+
it 'ignores blank keys' do
|
|
223
|
+
result = described_class.parse_query_params('=value&foo=bar')
|
|
224
|
+
|
|
225
|
+
expect(result).to eq({ 'foo' => 'bar' })
|
|
226
|
+
end
|
|
227
|
+
end
|
|
228
|
+
|
|
229
|
+
describe '.parse_batch' do
|
|
230
|
+
it 'parses multiple URLs' do
|
|
231
|
+
urls = [
|
|
232
|
+
'dashtrack.com',
|
|
233
|
+
'www.insurancesite.ai',
|
|
234
|
+
'https://hitting.com/index',
|
|
235
|
+
'aninvalidurl',
|
|
236
|
+
''
|
|
237
|
+
]
|
|
238
|
+
|
|
239
|
+
results = described_class.parse_batch(urls)
|
|
240
|
+
|
|
241
|
+
expect(results[0][:root_domain]).to eq('dashtrack.com')
|
|
242
|
+
expect(results[0][:subdomain]).to be_nil
|
|
243
|
+
|
|
244
|
+
expect(results[1][:root_domain]).to eq('insurancesite.ai')
|
|
245
|
+
expect(results[1][:subdomain]).to eq('www')
|
|
246
|
+
|
|
247
|
+
expect(results[2][:root_domain]).to eq('hitting.com')
|
|
248
|
+
expect(results[2][:subdomain]).to be_nil
|
|
249
|
+
|
|
250
|
+
expect(results[3]).to be_nil
|
|
251
|
+
expect(results[4]).to be_nil
|
|
252
|
+
end
|
|
253
|
+
|
|
254
|
+
it 'handles all invalid URLs' do
|
|
255
|
+
results = described_class.parse_batch(['invalid', '', nil])
|
|
256
|
+
|
|
257
|
+
expect(results).to all(be_nil)
|
|
258
|
+
end
|
|
259
|
+
|
|
260
|
+
it 'handles all valid URLs' do
|
|
261
|
+
urls = ['example.com', 'www.example.com', 'api.example.com']
|
|
262
|
+
|
|
263
|
+
results = described_class.parse_batch(urls)
|
|
264
|
+
|
|
265
|
+
expect(results).to all(be_a(Hash))
|
|
266
|
+
expect(results.map { |result| result[:root_domain] }).to all(eq('example.com'))
|
|
267
|
+
end
|
|
268
|
+
|
|
269
|
+
it 'returns empty array for non-enumerable inputs' do
|
|
270
|
+
expect(described_class.parse_batch(nil)).to eq([])
|
|
271
|
+
end
|
|
272
|
+
end
|
|
273
|
+
end
|
data/spec/spec_helper.rb
ADDED
metadata
ADDED
|
@@ -0,0 +1,85 @@
|
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
|
2
|
+
name: domain_extractor
|
|
3
|
+
version: !ruby/object:Gem::Version
|
|
4
|
+
version: 0.1.0
|
|
5
|
+
platform: ruby
|
|
6
|
+
authors:
|
|
7
|
+
- OpenSite AI
|
|
8
|
+
bindir: bin
|
|
9
|
+
cert_chain: []
|
|
10
|
+
date: 1980-01-02 00:00:00.000000000 Z
|
|
11
|
+
dependencies:
|
|
12
|
+
- !ruby/object:Gem::Dependency
|
|
13
|
+
name: public_suffix
|
|
14
|
+
requirement: !ruby/object:Gem::Requirement
|
|
15
|
+
requirements:
|
|
16
|
+
- - "~>"
|
|
17
|
+
- !ruby/object:Gem::Version
|
|
18
|
+
version: '6.0'
|
|
19
|
+
type: :runtime
|
|
20
|
+
prerelease: false
|
|
21
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
22
|
+
requirements:
|
|
23
|
+
- - "~>"
|
|
24
|
+
- !ruby/object:Gem::Version
|
|
25
|
+
version: '6.0'
|
|
26
|
+
description: DomainExtractor is a high-performance url parser and domain parser for
|
|
27
|
+
Ruby. It delivers precise domain extraction, query parameter parsing, url normalization,
|
|
28
|
+
and multi-part tld parsing via public_suffix for web scraping and analytics workflows.
|
|
29
|
+
email: dev@opensite.ai
|
|
30
|
+
executables: []
|
|
31
|
+
extensions: []
|
|
32
|
+
extra_rdoc_files:
|
|
33
|
+
- CHANGELOG.md
|
|
34
|
+
- LICENSE.txt
|
|
35
|
+
- README.md
|
|
36
|
+
files:
|
|
37
|
+
- ".rubocop.yml"
|
|
38
|
+
- CHANGELOG.md
|
|
39
|
+
- LICENSE.txt
|
|
40
|
+
- README.md
|
|
41
|
+
- lib/domain_extractor.rb
|
|
42
|
+
- lib/domain_extractor/normalizer.rb
|
|
43
|
+
- lib/domain_extractor/parser.rb
|
|
44
|
+
- lib/domain_extractor/query_params.rb
|
|
45
|
+
- lib/domain_extractor/result.rb
|
|
46
|
+
- lib/domain_extractor/validators.rb
|
|
47
|
+
- lib/domain_extractor/version.rb
|
|
48
|
+
- spec/domain_extractor_spec.rb
|
|
49
|
+
- spec/spec_helper.rb
|
|
50
|
+
homepage: https://github.com/opensite-ai/domain_extractor
|
|
51
|
+
licenses:
|
|
52
|
+
- MIT
|
|
53
|
+
metadata:
|
|
54
|
+
source_code_uri: https://github.com/opensite-ai/domain_extractor
|
|
55
|
+
changelog_uri: https://github.com/opensite-ai/domain_extractor/blob/main/CHANGELOG.md
|
|
56
|
+
documentation_uri: https://rubydoc.info/gems/domain_extractor
|
|
57
|
+
bug_tracker_uri: https://github.com/opensite-ai/domain_extractor/issues
|
|
58
|
+
homepage_uri: https://opensite.ai
|
|
59
|
+
wiki_uri: https://docs.devguides.com/domain_extractor
|
|
60
|
+
rubygems_mfa_required: 'true'
|
|
61
|
+
allowed_push_host: https://rubygems.org
|
|
62
|
+
rdoc_options:
|
|
63
|
+
- "--main"
|
|
64
|
+
- README.md
|
|
65
|
+
- "--title"
|
|
66
|
+
- DomainExtractor - URL Domain Component Extractor
|
|
67
|
+
- "--line-numbers"
|
|
68
|
+
- "--inline-source"
|
|
69
|
+
require_paths:
|
|
70
|
+
- lib
|
|
71
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
|
72
|
+
requirements:
|
|
73
|
+
- - ">="
|
|
74
|
+
- !ruby/object:Gem::Version
|
|
75
|
+
version: 2.7.0
|
|
76
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
77
|
+
requirements:
|
|
78
|
+
- - ">="
|
|
79
|
+
- !ruby/object:Gem::Version
|
|
80
|
+
version: '0'
|
|
81
|
+
requirements: []
|
|
82
|
+
rubygems_version: 3.7.2
|
|
83
|
+
specification_version: 4
|
|
84
|
+
summary: High-performance url parser and domain extractor for Ruby
|
|
85
|
+
test_files: []
|