domain_extractor 0.2.7 → 0.2.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,8 +1,9 @@
1
+ ![Domain Extractor and Parsing Ruby Gem](https://github.com/user-attachments/assets/b3fbe605-3a3e-4d45-b0ed-b061a807d61b)
2
+
1
3
  # DomainExtractor
2
4
 
3
5
  [![Gem Version](https://badge.fury.io/rb/domain_extractor.svg?v=020)](https://badge.fury.io/rb/domain_extractor)
4
6
  [![CI](https://github.com/opensite-ai/domain_extractor/actions/workflows/ci.yml/badge.svg)](https://github.com/opensite-ai/domain_extractor/actions/workflows/ci.yml)
5
- [![Code Climate](https://codeclimate.com/github/opensite-ai/domain_extractor/badges/gpa.svg)](https://codeclimate.com/github/opensite-ai/domain_extractor)
6
7
 
7
8
  A lightweight, robust Ruby library for url parsing and domain parsing with **accurate multi-part TLD support**. DomainExtractor delivers a high-throughput url parser and domain parser that excels at domain extraction tasks while staying friendly to analytics pipelines. Perfect for web scraping, analytics, url manipulation, query parameter parsing, and multi-environment domain analysis.
8
9
 
@@ -10,16 +11,20 @@ Use **DomainExtractor** whenever you need a dependable tld parser for tricky mul
10
11
 
11
12
  ## Why DomainExtractor?
12
13
 
14
+ ✅ **URI-Compatible Accessors** - Covers common absolute-URL workflows with a Ruby `URI`-style API
15
+ ✅ **Authentication Extraction** - Parse credentials from Redis, database, FTP, and API URLs
13
16
  ✅ **Accurate Multi-part TLD Parser** - Handles complex multi-part TLDs (co.uk, com.au, gov.br) using the [Public Suffix List](https://publicsuffix.org/)
14
17
  ✅ **Nested Subdomain Extraction** - Correctly parses multi-level subdomains (api.staging.example.com)
15
18
  ✅ **Smart URL Normalization** - Automatically handles URLs with or without schemes
16
19
  ✅ **Powerful URL Formatting** - Transform and standardize URLs with flexible options
17
20
  ✅ **Rails Integration** - Custom ActiveModel validator for declarative URL validation
18
21
  ✅ **Query Parameter Parsing** - Parse query strings into structured hashes
22
+ ✅ **Authentication Helpers** - Generate Basic Auth and Bearer token headers
19
23
  ✅ **Batch Processing** - Parse multiple URLs efficiently
20
24
  ✅ **IP Address Detection** - Identifies and handles IPv4 and IPv6 addresses
25
+ ✅ **Benchmark-Backed Performance** - Auth helpers run in low microseconds; full parses are documented in the included benchmark suite
21
26
  ✅ **Zero Configuration** - Works out of the box with sensible defaults
22
- ✅ **Well-Tested** - Comprehensive test suite covering edge cases
27
+ ✅ **Well-Tested** - 200+ comprehensive test cases covering all scenarios
23
28
 
24
29
  ## Installation
25
30
 
@@ -370,6 +375,235 @@ DomainExtractor.format(url_string, **options)
370
375
  # :use_trailing_slash (true/false)
371
376
  ```
372
377
 
378
+ ## Authentication Extraction
379
+
380
+ DomainExtractor provides comprehensive authentication extraction from URLs, supporting all major database systems, caching solutions, and file transfer protocols.
381
+
382
+ ### Supported URL Schemes
383
+
384
+ **Database Connections:**
385
+ - PostgreSQL: `postgresql://user:pass@host:5432/dbname`
386
+ - MySQL: `mysql://user:pass@host:3306/database`
387
+ - MongoDB: `mongodb+srv://user:pass@cluster.mongodb.net/db`
388
+ - CockroachDB: `postgresql://user:pass@host:26257/db`
389
+
390
+ **Caching & Message Queues:**
391
+ - Redis: `redis://user:pass@host:6379/0`
392
+ - Redis SSL: `rediss://:password@host:6380`
393
+
394
+ **File Transfer:**
395
+ - FTP: `ftp://user:pass@host/path`
396
+ - SFTP: `sftp://user:pass@host:22/path`
397
+ - FTPS: `ftps://user:pass@host:990/path`
398
+
399
+ **HTTP/HTTPS:**
400
+ - Basic Auth: `https://user:pass@api.example.com`
401
+
402
+ ### Basic Usage
403
+
404
+ ```ruby
405
+ # Parse Redis URL
406
+ redis_url = 'rediss://default:my_secret_pw@redis.cloud:6385/0'
407
+ result = DomainExtractor.parse(redis_url)
408
+
409
+ result.scheme # => "rediss"
410
+ result.user # => "default"
411
+ result.password # => "my_secret_pw"
412
+ result.host # => "redis.cloud"
413
+ result.port # => 6385
414
+ result.path # => "/0"
415
+
416
+ # Parse PostgreSQL URL
417
+ db_url = 'postgresql://appuser:SecurePass@db.prod.internal:5432/production'
418
+ result = DomainExtractor.parse(db_url)
419
+
420
+ result.user # => "appuser"
421
+ result.password # => "SecurePass"
422
+ result.host # => "db.prod.internal"
423
+ result.port # => 5432
424
+ result.path # => "/production"
425
+ ```
426
+
427
+ ### Special Character Handling
428
+
429
+ DomainExtractor automatically handles percent-encoded special characters in credentials:
430
+
431
+ ```ruby
432
+ # Password with special characters: P@ss:word!
433
+ url = 'redis://user:P%40ss%3Aword%21@localhost:6379'
434
+ result = DomainExtractor.parse(url)
435
+
436
+ result.password # => "P%40ss%3Aword%21" (encoded)
437
+ result.decoded_password # => "P@ss:word!" (decoded, ready to use)
438
+
439
+ # Username as email address
440
+ url = 'https://user%40domain.com:password@api.example.com'
441
+ result = DomainExtractor.parse(url)
442
+
443
+ result.user # => "user%40domain.com"
444
+ result.decoded_user # => "user@domain.com"
445
+ ```
446
+
447
+ ### Authentication Helper Methods
448
+
449
+ **Generate Basic Authentication Headers:**
450
+
451
+ ```ruby
452
+ # From parsed URL
453
+ result = DomainExtractor.parse('https://user:pass@api.example.com')
454
+ auth_header = result.basic_auth_header
455
+ # => "Basic dXNlcjpwYXNz"
456
+
457
+ # Use in HTTP request
458
+ require 'net/http'
459
+ uri = URI('https://api.example.com/endpoint')
460
+ request = Net::HTTP::Get.new(uri)
461
+ request['Authorization'] = auth_header
462
+
463
+ # Or use module method directly
464
+ header = DomainExtractor.basic_auth_header('username', 'password')
465
+ # => "Basic dXNlcm5hbWU6cGFzc3dvcmQ="
466
+ ```
467
+
468
+ **Generate Bearer Token Headers:**
469
+
470
+ ```ruby
471
+ token = 'eyJhbGciOiJIUzI1NiIs...'
472
+ header = DomainExtractor.bearer_auth_header(token)
473
+ # => "Bearer eyJhbGciOiJIUzI1NiIs..."
474
+
475
+ # Use in API request
476
+ request['Authorization'] = header
477
+ ```
478
+
479
+ **Encode/Decode Credentials:**
480
+
481
+ ```ruby
482
+ # Encode credentials for URL use
483
+ password = 'P@ss:word!'
484
+ encoded = DomainExtractor.encode_credential(password)
485
+ # => "P%40ss%3Aword%21"
486
+
487
+ # Build URL with encoded credentials
488
+ url = "redis://user:#{encoded}@localhost:6379"
489
+
490
+ # Decode credentials
491
+ decoded = DomainExtractor.decode_credential(encoded)
492
+ # => "P@ss:word!"
493
+ ```
494
+
495
+ ### Real-World Examples
496
+
497
+ **Database Connection Configuration:**
498
+
499
+ ```ruby
500
+ class DatabaseConfig
501
+ def self.from_url(url)
502
+ config = DomainExtractor.parse(url)
503
+
504
+ {
505
+ adapter: config.scheme.sub('postgresql', 'postgres'),
506
+ host: config.host,
507
+ port: config.port,
508
+ database: config.path&.sub('/', ''),
509
+ username: config.decoded_user,
510
+ password: config.decoded_password
511
+ }
512
+ end
513
+ end
514
+
515
+ # Usage
516
+ db_url = ENV['DATABASE_URL']
517
+ config = DatabaseConfig.from_url(db_url)
518
+ # => { adapter: "postgres", host: "db.prod.internal", port: 5432, ... }
519
+ ```
520
+
521
+ **Redis Connection Helper:**
522
+
523
+ ```ruby
524
+ class RedisConnection
525
+ def self.from_url(url)
526
+ config = DomainExtractor.parse(url)
527
+
528
+ Redis.new(
529
+ host: config.host,
530
+ port: config.port || 6379,
531
+ password: config.decoded_password,
532
+ db: config.path&.sub('/', '')&.to_i || 0,
533
+ ssl: config.scheme == 'rediss'
534
+ )
535
+ end
536
+ end
537
+
538
+ # Usage
539
+ redis = RedisConnection.from_url(ENV['REDIS_URL'])
540
+ ```
541
+
542
+ **SFTP Deployment Script:**
543
+
544
+ ```ruby
545
+ def deploy_via_sftp(url, local_path)
546
+ config = DomainExtractor.parse(url)
547
+
548
+ Net::SFTP.start(
549
+ config.host,
550
+ config.decoded_user,
551
+ password: config.decoded_password,
552
+ port: config.port || 22
553
+ ) do |sftp|
554
+ sftp.upload!(local_path, config.path)
555
+ end
556
+ end
557
+
558
+ # Usage
559
+ deploy_via_sftp(ENV['DEPLOY_URL'], './build')
560
+ ```
561
+
562
+ ### Security Best Practices
563
+
564
+ ⚠️ **Important Security Considerations:**
565
+
566
+ 1. **Never hardcode credentials in source code**
567
+ ```ruby
568
+ # ❌ Bad
569
+ url = 'redis://user:password@localhost:6379'
570
+
571
+ # ✅ Good
572
+ url = ENV['REDIS_URL']
573
+ url = Rails.application.credentials.redis[:url]
574
+ ```
575
+
576
+ 2. **Use environment variables or secret managers**
577
+ ```ruby
578
+ # ✅ Good
579
+ db_config = DomainExtractor.parse(ENV['DATABASE_URL'])
580
+ redis_config = DomainExtractor.parse(ENV['REDIS_URL'])
581
+ ```
582
+
583
+ 3. **Never log URLs with credentials**
584
+ ```ruby
585
+ # ❌ Bad
586
+ logger.info("Connecting to #{database_url}")
587
+
588
+ # ✅ Good
589
+ config = DomainExtractor.parse(database_url)
590
+ logger.info("Connecting to #{config.host}:#{config.port}")
591
+ ```
592
+
593
+ 4. **Always use TLS/SSL for credential transmission**
594
+ ```ruby
595
+ # ✅ Good - Use rediss:// not redis://
596
+ url = 'rediss://user:pass@redis.cloud:6380'
597
+
598
+ # ✅ Good - Use postgresql:// with sslmode
599
+ url = 'postgresql://user:pass@db.cloud:5432/db?sslmode=require'
600
+ ```
601
+
602
+ 5. **Rotate credentials regularly**
603
+ - Use secret rotation services (AWS Secrets Manager, HashiCorp Vault)
604
+ - Never commit credentials to version control
605
+ - Use `.env` files with `.gitignore`
606
+
373
607
  ## URL Formatting
374
608
 
375
609
  DomainExtractor provides powerful URL formatting capabilities to normalize, transform, and standardize URLs according to your application's requirements.
@@ -861,6 +1095,155 @@ secure.errors[:url]
861
1095
  # => ["must use https://"]
862
1096
  ```
863
1097
 
1098
+ ## URI-Compatible Access
1099
+
1100
+ DomainExtractor covers the most common absolute-URL workflows people reach for Ruby's `URI` library for, while adding domain extraction, auth helpers, and formatting utilities.
1101
+
1102
+ ### Why Replace URI?
1103
+
1104
+ **Performance:**
1105
+ - Included benchmarks measure roughly 5k-6k full parses/sec for common URLs on Ruby 3.4 / Apple Silicon
1106
+ - Auth helper methods remain microsecond-level operations
1107
+ - Domain extraction work happens in the same parse pass
1108
+
1109
+ **Features:**
1110
+ - Common absolute-URL component accessors and setters
1111
+ - PLUS: Multi-part TLD parsing
1112
+ - PLUS: Domain component extraction
1113
+ - PLUS: Decoded credentials
1114
+ - PLUS: Authentication helpers
1115
+ - PLUS: URL formatting
1116
+
1117
+ ### Migration from URI
1118
+
1119
+ **Low-friction migration for common absolute-URL use cases:**
1120
+
1121
+ ```ruby
1122
+ # Before (using URI)
1123
+ require 'uri'
1124
+
1125
+ uri = URI.parse('https://user:pass@example.com:8080/path?query=value#section')
1126
+ uri.scheme # => "https"
1127
+ uri.user # => "user"
1128
+ uri.password # => "pass"
1129
+ uri.host # => "example.com"
1130
+ uri.port # => 8080
1131
+ uri.path # => "/path"
1132
+ uri.query # => "query=value"
1133
+ uri.fragment # => "section"
1134
+
1135
+ # After (using DomainExtractor) - URI-style access plus domain helpers
1136
+ require 'domain_extractor'
1137
+
1138
+ result = DomainExtractor.parse('https://user:pass@example.com:8080/path?query=value#section')
1139
+ result.scheme # => "https"
1140
+ result.user # => "user"
1141
+ result.password # => "pass"
1142
+ result.host # => "example.com"
1143
+ result.port # => 8080
1144
+ result.path # => "/path"
1145
+ result.query # => "query=value"
1146
+ result.fragment # => "section"
1147
+
1148
+ # PLUS: Additional features not in URI along with each method
1149
+ # also having `?` and `!` variants for custom behavior
1150
+ result.subdomain # => nil
1151
+ result.domain # => "example"
1152
+ result.tld # => "com"
1153
+ result.root_domain # => "example.com"
1154
+ result.decoded_user # => "user"
1155
+ result.decoded_password # => "pass"
1156
+ result.basic_auth_header # => "Basic dXNlcjpwYXNz"
1157
+ ```
1158
+
1159
+ ### URI Method Compatibility
1160
+
1161
+ **Common absolute-URL URI methods are supported:**
1162
+
1163
+ ```ruby
1164
+ result = DomainExtractor.parse('https://api.example.com:8443/v1/users?page=2#results')
1165
+
1166
+ # Component accessors
1167
+ result.scheme # => "https"
1168
+ result.host # => "api.example.com"
1169
+ result.hostname # => "api.example.com"
1170
+ result.port # => 8443
1171
+ result.path # => "/v1/users"
1172
+ result.query # => "page=2"
1173
+ result.fragment # => "results"
1174
+
1175
+ # Authentication
1176
+ result.user # => nil
1177
+ result.password # => nil
1178
+ result.userinfo # => nil
1179
+
1180
+ # URI state checks
1181
+ result.absolute? # => true
1182
+ result.relative? # => false # bare hosts are normalized to https://
1183
+
1184
+ # Default ports
1185
+ result.default_port # => 443 (for https)
1186
+
1187
+ # String conversion
1188
+ result.to_s # => Full URL string
1189
+ result.to_str # => Alias for to_s
1190
+ result.to_h # => Hash representation
1191
+ ```
1192
+
1193
+ ### Advanced URI Features
1194
+
1195
+ **Proxy Detection:**
1196
+
1197
+ ```ruby
1198
+ # Automatically detects proxy from environment
1199
+ # Checks http_proxy, HTTP_PROXY, and no_proxy
1200
+ result = DomainExtractor.parse('https://api.example.com')
1201
+ proxy = result.find_proxy
1202
+ # => #<URI::HTTP http://proxy.company.com:8080> or nil
1203
+ ```
1204
+
1205
+ **URI Normalization:**
1206
+
1207
+ ```ruby
1208
+ result = DomainExtractor.parse('HTTP://EXAMPLE.COM:80/Path')
1209
+ normalized = result.normalize
1210
+
1211
+ normalized.scheme # => "http" (lowercased)
1212
+ normalized.host # => "example.com" (lowercased)
1213
+ normalized.port # => 80 (URI-compatible default port)
1214
+ normalized.to_s # => "http://example.com/Path"
1215
+ ```
1216
+
1217
+ **URI Merging:**
1218
+
1219
+ ```ruby
1220
+ base = DomainExtractor.parse('https://example.com/api/v1/')
1221
+ relative = 'users/123'
1222
+
1223
+ merged = base.merge(relative)
1224
+ merged.to_s # => "https://example.com/api/v1/users/123"
1225
+ ```
1226
+
1227
+ ### Component Setters
1228
+
1229
+ **Modify URI components programmatically:**
1230
+
1231
+ ```ruby
1232
+ result = DomainExtractor.parse('http://example.com')
1233
+
1234
+ # Set individual components
1235
+ result.scheme = 'https'
1236
+ result.host = 'secure.example.com'
1237
+ result.port = 8443
1238
+ result.path = '/api/endpoint'
1239
+ result.query = 'key=value'
1240
+ result.fragment = 'section'
1241
+
1242
+ # Build complete URL
1243
+ result.build_url
1244
+ # => "https://secure.example.com:8443/api/endpoint?key=value#section"
1245
+ ```
1246
+
864
1247
  ## Use Cases
865
1248
 
866
1249
  **Web Scraping**
@@ -892,8 +1275,8 @@ end
892
1275
 
893
1276
  Optimized for high-throughput production use:
894
1277
 
895
- - **Single URL parsing**: 15-30μs per URL (50,000+ URLs/second)
896
- - **Batch processing**: 50,000+ URLs/second sustained throughput
1278
+ - **Single URL parsing**: the included benchmarks currently land around 170-280μs for common absolute URLs on Ruby 3.4 / Apple Silicon
1279
+ - **Batch processing**: the included benchmarks currently land around 5k-6k URLs/sec for common workloads, with larger batches becoming allocation-bound
897
1280
  - **Memory efficient**: <100KB overhead, ~200 bytes per parse
898
1281
  - **Thread-safe**: Stateless modules, safe for concurrent use
899
1282
  - **Zero-allocation hot paths**: Frozen constants, pre-compiled regex
@@ -907,8 +1290,15 @@ View [performance analysis](https://github.com/opensite-ai/domain_extractor/blob
907
1290
  | Multi-part TLD parser | ✅ | ❌ | ❌ |
908
1291
  | Subdomain extraction | ✅ | ❌ | ❌ |
909
1292
  | Domain component separation | ✅ | ❌ | ❌ |
910
- | Built-in url normalization | ✅ | ❌ | |
1293
+ | Auth extraction & decoding | ✅ | ❌ | ⚠️ (basic) |
1294
+ | Authentication helpers | ✅ | ❌ | ❌ |
1295
+ | Built-in url normalization | ✅ | ❌ | ✅ |
1296
+ | URL formatting | ✅ | ❌ | ❌ |
1297
+ | Proxy detection | ✅ | ❌ | ✅ |
1298
+ | Performance profile | Feature-rich single-pass parse | Varies | Faster raw parse |
1299
+ | Auth helper speed | Microsecond-level | ❌ | ❌ |
911
1300
  | Lightweight | ✅ | ❌ | ✅ |
1301
+ | Rails validator | ✅ | ❌ | ❌ |
912
1302
 
913
1303
  ## Requirements
914
1304
 
@@ -0,0 +1,82 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'uri'
4
+
5
+ module DomainExtractor
6
+ # Auth module extracts authentication components from URIs
7
+ # Handles userinfo parsing with support for special characters and percent-encoding
8
+ module Auth
9
+ # Frozen constants for zero allocation
10
+ COLON = ':'
11
+ EMPTY_AUTH = {
12
+ user: nil,
13
+ password: nil,
14
+ userinfo: nil,
15
+ decoded_user: nil,
16
+ decoded_password: nil
17
+ }.freeze
18
+
19
+ module_function
20
+
21
+ # Extract userinfo components from a URI object
22
+ # @param uri [URI::Generic] The parsed URI object
23
+ # @return [Hash] Hash containing :user, :password, :userinfo, :decoded_user, :decoded_password
24
+ def extract(uri)
25
+ return empty_auth unless uri&.userinfo
26
+
27
+ user, password = split_userinfo(uri.userinfo)
28
+
29
+ {
30
+ user: user,
31
+ password: password,
32
+ userinfo: uri.userinfo,
33
+ decoded_user: decode_component(user),
34
+ decoded_password: decode_component(password)
35
+ }
36
+ end
37
+
38
+ # Split userinfo into user and password components
39
+ # Handles edge cases like password-only (":password") and user-only ("user")
40
+ # @param userinfo [String] The userinfo string from URI
41
+ # @return [Array<String, String>] Array of [user, password]
42
+ def split_userinfo(userinfo)
43
+ return [nil, nil] if userinfo.nil? || userinfo.empty?
44
+
45
+ # Find first colon to split user:password
46
+ colon_index = userinfo.index(COLON)
47
+
48
+ if colon_index.nil?
49
+ # No colon means user-only
50
+ [userinfo, nil]
51
+ elsif colon_index.zero?
52
+ # Starts with colon means password-only (Redis pattern: ":password")
53
+ [nil, userinfo[1..]]
54
+ else
55
+ # Normal case: "user:password"
56
+ [userinfo[0...colon_index], userinfo[(colon_index + 1)..]]
57
+ end
58
+ end
59
+ private_class_method :split_userinfo
60
+
61
+ # Decode percent-encoded component
62
+ # @param component [String, nil] The component to decode
63
+ # @return [String, nil] Decoded component or nil
64
+ def decode_component(component)
65
+ return nil if component.nil?
66
+ return component if component.empty?
67
+
68
+ URI::DEFAULT_PARSER.unescape(component)
69
+ rescue StandardError
70
+ # If decoding fails, return original
71
+ component
72
+ end
73
+ private_class_method :decode_component
74
+
75
+ # Return empty auth hash
76
+ # @return [Hash] Empty auth components
77
+ def empty_auth
78
+ EMPTY_AUTH
79
+ end
80
+ private_class_method :empty_auth
81
+ end
82
+ end