UrlCategorise 0.1.3 โ 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/settings.local.json +8 -2
- data/.gitignore +2 -0
- data/CLAUDE.md +140 -2
- data/Gemfile.lock +17 -1
- data/README.md +450 -7
- data/bin/export_csv +120 -0
- data/bin/export_hosts +68 -0
- data/bin/generate_video_lists +373 -0
- data/bin/rake +2 -0
- data/correct_usage_example.rb +64 -0
- data/docs/v0.1.4-features.md +215 -0
- data/docs/video-url-detection.md +353 -0
- data/lib/url_categorise/active_record_client.rb +1 -1
- data/lib/url_categorise/client.rb +699 -39
- data/lib/url_categorise/constants.rb +9 -6
- data/lib/url_categorise/dataset_processor.rb +27 -10
- data/lib/url_categorise/iab_compliance.rb +149 -0
- data/lib/url_categorise/version.rb +1 -1
- data/lib/url_categorise.rb +2 -0
- data/lists/video_hosting_domains.hosts +7057 -0
- data/lists/video_url_patterns.txt +297 -0
- data/url_categorise.gemspec +5 -2
- metadata +70 -3
@@ -0,0 +1,215 @@
|
|
1
|
+
# UrlCategorise v0.1.4 Release Notes
|
2
|
+
|
3
|
+
## New Features Added
|
4
|
+
|
5
|
+
### ๐ Dataset-Specific Helper Methods
|
6
|
+
|
7
|
+
Added dedicated helper methods for tracking dataset-only statistics separately from DNS blocklists:
|
8
|
+
|
9
|
+
```ruby
|
10
|
+
client = UrlCategorise::Client.new(dataset_config: { kaggle: {} })
|
11
|
+
client.load_kaggle_dataset('owner', 'dataset-name')
|
12
|
+
|
13
|
+
# New dataset-specific methods
|
14
|
+
client.count_of_dataset_hosts # Returns hosts from datasets only
|
15
|
+
client.count_of_dataset_categories # Returns categories from datasets only
|
16
|
+
|
17
|
+
# Existing methods still work and count both
|
18
|
+
client.count_of_hosts # All hosts (DNS lists + datasets)
|
19
|
+
client.count_of_categories # All categories (DNS lists + datasets)
|
20
|
+
```
|
21
|
+
|
22
|
+
**Implementation Details:**
|
23
|
+
- `count_of_dataset_hosts` - Sums hosts from categories tracked in `@dataset_categories` Set
|
24
|
+
- `count_of_dataset_categories` - Returns size of `@dataset_categories` Set
|
25
|
+
- Gracefully handles nil categories with safe navigation (`&.size || 0`)
|
26
|
+
- Both methods return 0 when no datasets are loaded
|
27
|
+
|
28
|
+
### ๐ IAB Content Taxonomy Compliance
|
29
|
+
|
30
|
+
Full support for IAB (Interactive Advertising Bureau) Content Taxonomy standards:
|
31
|
+
|
32
|
+
```ruby
|
33
|
+
# Enable IAB v3.0 compliance (recommended)
|
34
|
+
client = UrlCategorise::Client.new(
|
35
|
+
iab_compliance: true,
|
36
|
+
iab_version: :v3
|
37
|
+
)
|
38
|
+
|
39
|
+
# Enable IAB v2.0 compliance
|
40
|
+
client = UrlCategorise::Client.new(
|
41
|
+
iab_compliance: true,
|
42
|
+
iab_version: :v2
|
43
|
+
)
|
44
|
+
|
45
|
+
# Categorization returns IAB codes instead of custom categories
|
46
|
+
categories = client.categorise("badsite.com")
|
47
|
+
puts categories # => ["626"] (IAB v3 code for illegal content)
|
48
|
+
|
49
|
+
# Helper methods
|
50
|
+
client.iab_compliant? # => true/false
|
51
|
+
client.get_iab_mapping(:malware) # => "626" (v3) or "IAB25" (v2)
|
52
|
+
```
|
53
|
+
|
54
|
+
**IAB Category Mappings:**
|
55
|
+
|
56
|
+
**IAB Content Taxonomy v3.0:**
|
57
|
+
- Security threats (`malware`, `phishing`, `illegal`) โ `626` (Illegal Content)
|
58
|
+
- Advertising (`advertising`, `mobile_ads`) โ `3` (Advertising)
|
59
|
+
- Gambling โ `7-39` (Gambling subcategory)
|
60
|
+
- Adult content (`pornography`) โ `626` (Adult Content)
|
61
|
+
- Social platforms โ `14` (Society)
|
62
|
+
- Technology โ `19` (Technology & Computing)
|
63
|
+
|
64
|
+
**IAB Content Taxonomy v2.0:**
|
65
|
+
- Security threats โ `IAB25` (Non-Standard Content)
|
66
|
+
- Advertising โ `IAB3` (Advertising)
|
67
|
+
- Gambling โ `IAB7-39` (Gambling subcategory)
|
68
|
+
- Adult content โ `IAB25-3` (Pornography)
|
69
|
+
|
70
|
+
**Implementation Details:**
|
71
|
+
- New `IabCompliance` module with comprehensive mappings
|
72
|
+
- Support for both v2.0 and v3.0 standards
|
73
|
+
- IAB compliance affects all categorization methods (`categorise`, `categorise_ip`, `resolve_and_categorise`)
|
74
|
+
- Automatic deduplication of IAB codes
|
75
|
+
- Graceful handling of unknown categories (returns 'Unknown')
|
76
|
+
|
77
|
+
### ๐งช Comprehensive Test Coverage
|
78
|
+
|
79
|
+
Added extensive test suites for new features:
|
80
|
+
|
81
|
+
**Dataset Methods Tests:**
|
82
|
+
- `client_dataset_methods_test.rb` (9 tests)
|
83
|
+
- Tests for empty, populated, and mixed dataset scenarios
|
84
|
+
- Integration tests with existing methods
|
85
|
+
- Edge case handling (nil categories, empty arrays)
|
86
|
+
|
87
|
+
**IAB Compliance Tests:**
|
88
|
+
- `iab_compliance_test.rb` (14 tests) - Module functionality
|
89
|
+
- `client_iab_compliance_test.rb` (19 tests) - Client integration
|
90
|
+
- Comprehensive mapping validation for v2 and v3
|
91
|
+
- Integration with all categorization methods
|
92
|
+
- DNS resolution with IAB compliance
|
93
|
+
- Error handling and edge cases
|
94
|
+
|
95
|
+
**Total Test Stats:**
|
96
|
+
- **316 tests, 2455 assertions, 0 failures, 0 errors**
|
97
|
+
- **94.69% line coverage** (660/697 lines)
|
98
|
+
- All new features fully tested with edge cases
|
99
|
+
|
100
|
+
### ๐ง Technical Implementation
|
101
|
+
|
102
|
+
**New Files:**
|
103
|
+
- `lib/url_categorise/iab_compliance.rb` - IAB mapping module
|
104
|
+
- `test/url_categorise/client_dataset_methods_test.rb` - Dataset helper tests
|
105
|
+
- `test/url_categorise/iab_compliance_test.rb` - IAB module tests
|
106
|
+
- `test/url_categorise/client_iab_compliance_test.rb` - IAB client integration tests
|
107
|
+
|
108
|
+
**Updated Files:**
|
109
|
+
- `lib/url_categorise/client.rb` - Added dataset helpers and IAB support
|
110
|
+
- `lib/url_categorise.rb` - Required new IAB module
|
111
|
+
- `lib/url_categorise/version.rb` - Bumped to 0.1.4
|
112
|
+
|
113
|
+
**New Client Attributes:**
|
114
|
+
- `iab_compliance_enabled` - Boolean flag for IAB compliance
|
115
|
+
- `iab_version` - IAB taxonomy version (:v2 or :v3)
|
116
|
+
|
117
|
+
**New Client Methods:**
|
118
|
+
- `count_of_dataset_hosts` - Dataset-specific host count
|
119
|
+
- `count_of_dataset_categories` - Dataset-specific category count
|
120
|
+
- `iab_compliant?` - Check IAB compliance status
|
121
|
+
- `get_iab_mapping(category)` - Get IAB code for category
|
122
|
+
|
123
|
+
### ๐ Usage Examples
|
124
|
+
|
125
|
+
**Dataset Statistics:**
|
126
|
+
```ruby
|
127
|
+
client = UrlCategorise::Client.new(dataset_config: { kaggle: {} })
|
128
|
+
client.load_kaggle_dataset('owner', 'dataset')
|
129
|
+
|
130
|
+
puts "Total hosts: #{client.count_of_hosts}" # All sources
|
131
|
+
puts "Dataset hosts: #{client.count_of_dataset_hosts}" # Datasets only
|
132
|
+
puts "DNS list hosts: #{client.count_of_hosts - client.count_of_dataset_hosts}"
|
133
|
+
```
|
134
|
+
|
135
|
+
**IAB Compliance:**
|
136
|
+
```ruby
|
137
|
+
# Production environment with IAB compliance
|
138
|
+
client = UrlCategorise::Client.new(
|
139
|
+
iab_compliance: true,
|
140
|
+
iab_version: :v3,
|
141
|
+
dataset_config: { kaggle: { username: 'user', api_key: 'key' } }
|
142
|
+
)
|
143
|
+
|
144
|
+
# All methods return IAB codes
|
145
|
+
domain_cats = client.categorise("example.com") # => ["3", "626"]
|
146
|
+
ip_cats = client.categorise_ip("192.168.1.100") # => ["626"]
|
147
|
+
resolved_cats = client.resolve_and_categorise("site.com") # => ["3"]
|
148
|
+
|
149
|
+
# Check compliance
|
150
|
+
puts "IAB compliant: #{client.iab_compliant?}" # => true
|
151
|
+
puts "Using version: #{client.iab_version}" # => :v3
|
152
|
+
```
|
153
|
+
|
154
|
+
**Rails Service Integration:**
|
155
|
+
```ruby
|
156
|
+
class UrlCategorizerService
|
157
|
+
def initialize
|
158
|
+
@client = UrlCategorise::ActiveRecordClient.new(
|
159
|
+
iab_compliance: Rails.env.production?,
|
160
|
+
iab_version: :v3,
|
161
|
+
dataset_config: {
|
162
|
+
kaggle: {
|
163
|
+
username: ENV['KAGGLE_USERNAME'],
|
164
|
+
api_key: ENV['KAGGLE_API_KEY']
|
165
|
+
}
|
166
|
+
}
|
167
|
+
)
|
168
|
+
end
|
169
|
+
|
170
|
+
def stats
|
171
|
+
{
|
172
|
+
total_hosts: @client.count_of_hosts,
|
173
|
+
dataset_hosts: @client.count_of_dataset_hosts,
|
174
|
+
dns_list_hosts: @client.count_of_hosts - @client.count_of_dataset_hosts,
|
175
|
+
iab_compliant: @client.iab_compliant?,
|
176
|
+
iab_version: @client.iab_version
|
177
|
+
}
|
178
|
+
end
|
179
|
+
end
|
180
|
+
```
|
181
|
+
|
182
|
+
### ๐ Migration Guide
|
183
|
+
|
184
|
+
**From v0.1.3 to v0.1.4:**
|
185
|
+
|
186
|
+
1. **No Breaking Changes** - All existing code continues to work
|
187
|
+
2. **Optional New Features** - IAB compliance and dataset helpers are opt-in
|
188
|
+
3. **Enhanced Statistics** - Use new helper methods for better insights
|
189
|
+
|
190
|
+
**Recommended Updates:**
|
191
|
+
```ruby
|
192
|
+
# Before (still works)
|
193
|
+
client = UrlCategorise::Client.new
|
194
|
+
puts "Total: #{client.count_of_hosts}"
|
195
|
+
|
196
|
+
# After (enhanced)
|
197
|
+
client = UrlCategorise::Client.new(
|
198
|
+
iab_compliance: true, # Optional IAB compliance
|
199
|
+
iab_version: :v3 # Optional version selection
|
200
|
+
)
|
201
|
+
puts "Total hosts: #{client.count_of_hosts}"
|
202
|
+
puts "Dataset hosts: #{client.count_of_dataset_hosts}"
|
203
|
+
puts "IAB compliant: #{client.iab_compliant?}"
|
204
|
+
```
|
205
|
+
|
206
|
+
### ๐ Quality Assurance
|
207
|
+
|
208
|
+
- โ
All 316 tests pass with 0 failures/errors
|
209
|
+
- โ
94.69% line coverage maintained
|
210
|
+
- โ
Code style enforced with rubocop
|
211
|
+
- โ
Comprehensive edge case testing
|
212
|
+
- โ
Memory-efficient implementation
|
213
|
+
- โ
Backward compatibility preserved
|
214
|
+
|
215
|
+
This release adds powerful new features while maintaining the reliability and performance standards established in previous versions.
|
@@ -0,0 +1,353 @@
|
|
1
|
+
# Video URL Detection Feature
|
2
|
+
|
3
|
+
## Overview
|
4
|
+
|
5
|
+
The UrlCategorise gem now includes advanced video URL detection capabilities that allow you to determine if a URL is a direct link to video content, rather than just a homepage, profile page, or other non-video resource on a video hosting domain.
|
6
|
+
|
7
|
+
## Key Features
|
8
|
+
|
9
|
+
### ๐ฌ Direct Video URL Detection
|
10
|
+
|
11
|
+
New `video_url?` method provides precise detection of video content URLs:
|
12
|
+
|
13
|
+
```ruby
|
14
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
15
|
+
|
16
|
+
# Direct video content URLs return true
|
17
|
+
client.video_url?("https://youtube.com/watch?v=dQw4w9WgXcQ") # => true
|
18
|
+
client.video_url?("https://vimeo.com/123456789") # => true
|
19
|
+
client.video_url?("https://tiktok.com/@user/video/123") # => true
|
20
|
+
client.video_url?("https://dailymotion.com/video/x7abc123") # => true
|
21
|
+
|
22
|
+
# Non-video URLs return false
|
23
|
+
client.video_url?("https://youtube.com") # => false
|
24
|
+
client.video_url?("https://youtube.com/@channel") # => false
|
25
|
+
client.video_url?("https://google.com/search?q=cats") # => false
|
26
|
+
```
|
27
|
+
|
28
|
+
### ๐ก Comprehensive Video Hosting Lists
|
29
|
+
|
30
|
+
- **3,500+ video hosting domains** extracted from yt-dlp (YouTube-dl fork) extractors
|
31
|
+
- **50+ regex patterns** for identifying video content URLs
|
32
|
+
- **Automatic generation** using `bin/generate_video_lists` script
|
33
|
+
- **Remote list fetching** from GitHub repository
|
34
|
+
|
35
|
+
### ๐ Remote Pattern Files
|
36
|
+
|
37
|
+
Video patterns are automatically fetched from remote GitHub repository:
|
38
|
+
|
39
|
+
- **Video domains**: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts`
|
40
|
+
- **URL patterns**: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt`
|
41
|
+
|
42
|
+
### ๐ง Enhanced Categorization
|
43
|
+
|
44
|
+
Regex categorization provides more specific categorization for video URLs:
|
45
|
+
|
46
|
+
```ruby
|
47
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
48
|
+
|
49
|
+
# Basic domain categorization
|
50
|
+
client.categorise('https://youtube.com') # => [:video_hosting]
|
51
|
+
|
52
|
+
# Enhanced content detection for actual video URLs
|
53
|
+
client.categorise('https://youtube.com/watch?v=abc123') # => [:video_hosting, :video_hosting_content]
|
54
|
+
```
|
55
|
+
|
56
|
+
## Technical Implementation
|
57
|
+
|
58
|
+
### Constants Updates
|
59
|
+
|
60
|
+
New constants in `UrlCategorise::Constants`:
|
61
|
+
|
62
|
+
```ruby
|
63
|
+
# Remote video URL patterns file
|
64
|
+
VIDEO_URL_PATTERNS_FILE = 'https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt'.freeze
|
65
|
+
|
66
|
+
# Updated video hosting category to use remote list
|
67
|
+
video_hosting: ['https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts']
|
68
|
+
```
|
69
|
+
|
70
|
+
### Client Enhancements
|
71
|
+
|
72
|
+
New `Client` class features:
|
73
|
+
|
74
|
+
```ruby
|
75
|
+
# Default regex patterns file now uses remote URL
|
76
|
+
attribute :regex_patterns_file, default: -> { VIDEO_URL_PATTERNS_FILE }
|
77
|
+
|
78
|
+
# New video URL detection method
|
79
|
+
def video_url?(url)
|
80
|
+
# 1. Check if URL is valid and not empty
|
81
|
+
# 2. Ensure regex categorization is enabled
|
82
|
+
# 3. Verify URL is from a video hosting domain
|
83
|
+
# 4. Check if URL matches video content patterns
|
84
|
+
# 5. Return true only if all conditions are met
|
85
|
+
end
|
86
|
+
|
87
|
+
# Enhanced pattern fetching supports remote URLs
|
88
|
+
def fetch_regex_patterns_content
|
89
|
+
# Supports HTTP/HTTPS, file://, and direct file paths
|
90
|
+
# Graceful error handling for network failures
|
91
|
+
end
|
92
|
+
```
|
93
|
+
|
94
|
+
### Pattern Generation
|
95
|
+
|
96
|
+
The `bin/generate_video_lists` script has been enhanced:
|
97
|
+
|
98
|
+
- **Fixed terminal output** - Clean progress display without character overlap
|
99
|
+
- **Suppressed regex warnings** - No more nested operator warnings
|
100
|
+
- **Improved domain extraction** - More comprehensive pattern matching
|
101
|
+
- **Manual high-priority patterns** - Curated YouTube, Vimeo, TikTok patterns
|
102
|
+
- **3,500+ domains extracted** - Up from 96 domains previously
|
103
|
+
|
104
|
+
## Usage Examples
|
105
|
+
|
106
|
+
### Basic Video URL Detection
|
107
|
+
|
108
|
+
```ruby
|
109
|
+
require 'url_categorise'
|
110
|
+
|
111
|
+
# Initialize client with regex categorization
|
112
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
113
|
+
|
114
|
+
# Test various video URLs
|
115
|
+
urls = [
|
116
|
+
'https://youtube.com/watch?v=dQw4w9WgXcQ', # YouTube video
|
117
|
+
'https://youtube.com', # YouTube homepage
|
118
|
+
'https://youtube.com/@pewdiepie', # YouTube channel
|
119
|
+
'https://vimeo.com/123456789', # Vimeo video
|
120
|
+
'https://tiktok.com/@user/video/123', # TikTok video
|
121
|
+
'https://google.com/search?q=cats' # Non-video site
|
122
|
+
]
|
123
|
+
|
124
|
+
urls.each do |url|
|
125
|
+
is_video = client.video_url?(url)
|
126
|
+
categories = client.categorise(url)
|
127
|
+
puts "#{url} -> video: #{is_video}, categories: #{categories}"
|
128
|
+
end
|
129
|
+
```
|
130
|
+
|
131
|
+
### Content Filtering Application
|
132
|
+
|
133
|
+
```ruby
|
134
|
+
class ContentFilter
|
135
|
+
def initialize
|
136
|
+
@client = UrlCategorise::Client.new(regex_categorization: true)
|
137
|
+
end
|
138
|
+
|
139
|
+
def filter_video_content(urls)
|
140
|
+
results = {
|
141
|
+
video_content: [],
|
142
|
+
video_sites: [],
|
143
|
+
other: []
|
144
|
+
}
|
145
|
+
|
146
|
+
urls.each do |url|
|
147
|
+
if @client.video_url?(url)
|
148
|
+
results[:video_content] << url
|
149
|
+
elsif @client.categorise(url).include?(:video_hosting)
|
150
|
+
results[:video_sites] << url
|
151
|
+
else
|
152
|
+
results[:other] << url
|
153
|
+
end
|
154
|
+
end
|
155
|
+
|
156
|
+
results
|
157
|
+
end
|
158
|
+
end
|
159
|
+
|
160
|
+
# Usage
|
161
|
+
filter = ContentFilter.new
|
162
|
+
results = filter.filter_video_content([
|
163
|
+
'https://youtube.com/watch?v=abc', # -> :video_content
|
164
|
+
'https://youtube.com/@channel', # -> :video_sites
|
165
|
+
'https://example.com' # -> :other
|
166
|
+
])
|
167
|
+
```
|
168
|
+
|
169
|
+
### Rails Integration
|
170
|
+
|
171
|
+
```ruby
|
172
|
+
# app/services/video_url_service.rb
|
173
|
+
class VideoUrlService
|
174
|
+
include Singleton
|
175
|
+
|
176
|
+
def initialize
|
177
|
+
@client = UrlCategorise::Client.new(
|
178
|
+
regex_categorization: true,
|
179
|
+
cache_dir: Rails.root.join('tmp', 'url_cache')
|
180
|
+
)
|
181
|
+
end
|
182
|
+
|
183
|
+
def video_content?(url)
|
184
|
+
Rails.cache.fetch("video_content_#{Digest::MD5.hexdigest(url)}", expires_in: 1.hour) do
|
185
|
+
@client.video_url?(url)
|
186
|
+
end
|
187
|
+
end
|
188
|
+
|
189
|
+
def categorize_video_url(url)
|
190
|
+
categories = @client.categorise(url)
|
191
|
+
is_video_content = @client.video_url?(url)
|
192
|
+
|
193
|
+
{
|
194
|
+
url: url,
|
195
|
+
categories: categories,
|
196
|
+
is_video_content: is_video_content,
|
197
|
+
risk_level: calculate_risk_level(categories)
|
198
|
+
}
|
199
|
+
end
|
200
|
+
|
201
|
+
private
|
202
|
+
|
203
|
+
def calculate_risk_level(categories)
|
204
|
+
return 'safe' if categories.empty?
|
205
|
+
return 'blocked' if (categories & [:malware, :phishing]).any?
|
206
|
+
return 'restricted' if (categories & [:pornography, :gambling]).any?
|
207
|
+
'allowed'
|
208
|
+
end
|
209
|
+
end
|
210
|
+
|
211
|
+
# Usage in controllers
|
212
|
+
class VideosController < ApplicationController
|
213
|
+
def check
|
214
|
+
url = params[:url]
|
215
|
+
result = VideoUrlService.instance.categorize_video_url(url)
|
216
|
+
render json: result
|
217
|
+
end
|
218
|
+
end
|
219
|
+
```
|
220
|
+
|
221
|
+
## Testing
|
222
|
+
|
223
|
+
Comprehensive test suite with 15 new tests covering:
|
224
|
+
|
225
|
+
### Core Functionality Tests
|
226
|
+
|
227
|
+
- Video URL detection when regex categorization is disabled/enabled
|
228
|
+
- Detection of video vs non-video domains
|
229
|
+
- Video content URLs vs homepage/profile URLs
|
230
|
+
- Multiple video hosting categories (`video`, `video_hosting`, etc.)
|
231
|
+
- Graceful handling of invalid URLs
|
232
|
+
|
233
|
+
### Network and Error Handling Tests
|
234
|
+
|
235
|
+
- Remote pattern file fetching with mocked HTTP requests
|
236
|
+
- Graceful failure when remote files are not accessible
|
237
|
+
- Timeout and network error handling
|
238
|
+
- Invalid regex pattern handling
|
239
|
+
|
240
|
+
### Test Results
|
241
|
+
|
242
|
+
- โ
**383 tests, 2,729 assertions, 0 failures, 0 errors**
|
243
|
+
- โ
**92.88% line coverage** (926/997 lines)
|
244
|
+
- โ
All video URL detection functionality fully tested
|
245
|
+
|
246
|
+
## Performance Considerations
|
247
|
+
|
248
|
+
### Caching
|
249
|
+
|
250
|
+
- Pattern files are cached locally when `cache_dir` is specified
|
251
|
+
- Remote patterns are only fetched once per session
|
252
|
+
- Client-side caching recommended for high-traffic applications
|
253
|
+
|
254
|
+
### Memory Usage
|
255
|
+
|
256
|
+
- Regex patterns are compiled once during client initialization
|
257
|
+
- Minimal memory overhead for video URL detection
|
258
|
+
- Efficient Set-based domain lookups
|
259
|
+
|
260
|
+
### Network Optimization
|
261
|
+
|
262
|
+
- Remote pattern files are small (~50KB total)
|
263
|
+
- Graceful fallback to local files if remote fetch fails
|
264
|
+
- HTTP timeout configuration supported
|
265
|
+
|
266
|
+
## Migration Guide
|
267
|
+
|
268
|
+
### From Previous Versions
|
269
|
+
|
270
|
+
No breaking changes - video URL detection is entirely optional:
|
271
|
+
|
272
|
+
```ruby
|
273
|
+
# Existing code continues to work unchanged
|
274
|
+
client = UrlCategorise::Client.new
|
275
|
+
categories = client.categorise('youtube.com')
|
276
|
+
|
277
|
+
# Enable video URL detection when needed
|
278
|
+
client_with_video = UrlCategorise::Client.new(regex_categorization: true)
|
279
|
+
is_video = client_with_video.video_url?('https://youtube.com/watch?v=abc')
|
280
|
+
```
|
281
|
+
|
282
|
+
### Enabling Video Detection
|
283
|
+
|
284
|
+
To use video URL detection in existing applications:
|
285
|
+
|
286
|
+
1. **Enable regex categorization**: `regex_categorization: true`
|
287
|
+
2. **Use the new method**: `client.video_url?(url)`
|
288
|
+
3. **Optional configuration**: Custom pattern files, caching
|
289
|
+
|
290
|
+
### Configuration Options
|
291
|
+
|
292
|
+
```ruby
|
293
|
+
# Default configuration (recommended)
|
294
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
295
|
+
|
296
|
+
# Custom pattern file (local)
|
297
|
+
client = UrlCategorise::Client.new(
|
298
|
+
regex_categorization: true,
|
299
|
+
regex_patterns_file: 'path/to/custom/patterns.txt'
|
300
|
+
)
|
301
|
+
|
302
|
+
# Custom pattern file (remote)
|
303
|
+
client = UrlCategorise::Client.new(
|
304
|
+
regex_categorization: true,
|
305
|
+
regex_patterns_file: 'https://example.com/patterns.txt'
|
306
|
+
)
|
307
|
+
|
308
|
+
# With caching
|
309
|
+
client = UrlCategorise::Client.new(
|
310
|
+
regex_categorization: true,
|
311
|
+
cache_dir: './cache'
|
312
|
+
)
|
313
|
+
```
|
314
|
+
|
315
|
+
## Maintenance
|
316
|
+
|
317
|
+
### Updating Video Lists
|
318
|
+
|
319
|
+
The video hosting lists are automatically maintained in the GitHub repository. To generate updated lists locally:
|
320
|
+
|
321
|
+
```bash
|
322
|
+
# Generate fresh video hosting lists
|
323
|
+
ruby bin/generate_video_lists
|
324
|
+
|
325
|
+
# This creates/updates:
|
326
|
+
# - lists/video_hosting_domains.hosts (3,500+ domains)
|
327
|
+
# - lists/video_url_patterns.txt (50+ patterns)
|
328
|
+
```
|
329
|
+
|
330
|
+
### Manual Pattern Curation
|
331
|
+
|
332
|
+
High-priority patterns are manually curated in the generation script:
|
333
|
+
|
334
|
+
- YouTube video and Shorts URLs
|
335
|
+
- Vimeo video URLs
|
336
|
+
- Dailymotion video URLs
|
337
|
+
- Twitch video URLs
|
338
|
+
- TikTok video URLs
|
339
|
+
|
340
|
+
### List Health Monitoring
|
341
|
+
|
342
|
+
Use the existing health monitoring for video hosting lists:
|
343
|
+
|
344
|
+
```ruby
|
345
|
+
client = UrlCategorise::Client.new
|
346
|
+
report = client.check_all_lists
|
347
|
+
|
348
|
+
# Check video hosting category specifically
|
349
|
+
video_status = report[:successful_lists][:video_hosting]
|
350
|
+
puts "Video hosting lists: #{video_status&.length || 0} URLs working"
|
351
|
+
```
|
352
|
+
|
353
|
+
This feature provides enterprise-grade video URL detection while maintaining the gem's focus on performance, reliability, and ease of use.
|