UrlCategorise 0.1.6 → 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/settings.local.json +2 -1
- data/.gitignore +1 -0
- data/CLAUDE.md +71 -8
- data/Gemfile.lock +5 -1
- data/README.md +129 -11
- data/bin/export_csv +44 -7
- data/bin/generate_video_lists +373 -0
- data/docs/video-url-detection.md +353 -0
- data/lib/url_categorise/client.rb +320 -58
- data/lib/url_categorise/constants.rb +9 -6
- data/lib/url_categorise/dataset_processor.rb +18 -6
- data/lib/url_categorise/iab_compliance.rb +2 -0
- data/lib/url_categorise/version.rb +1 -1
- data/lists/video_hosting_domains.hosts +7057 -0
- data/lists/video_url_patterns.txt +297 -0
- data/url_categorise.gemspec +1 -0
- metadata +19 -1
@@ -0,0 +1,353 @@
|
|
1
|
+
# Video URL Detection Feature
|
2
|
+
|
3
|
+
## Overview
|
4
|
+
|
5
|
+
The UrlCategorise gem now includes advanced video URL detection capabilities that allow you to determine if a URL is a direct link to video content, rather than just a homepage, profile page, or other non-video resource on a video hosting domain.
|
6
|
+
|
7
|
+
## Key Features
|
8
|
+
|
9
|
+
### 🎬 Direct Video URL Detection
|
10
|
+
|
11
|
+
New `video_url?` method provides precise detection of video content URLs:
|
12
|
+
|
13
|
+
```ruby
|
14
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
15
|
+
|
16
|
+
# Direct video content URLs return true
|
17
|
+
client.video_url?("https://youtube.com/watch?v=dQw4w9WgXcQ") # => true
|
18
|
+
client.video_url?("https://vimeo.com/123456789") # => true
|
19
|
+
client.video_url?("https://tiktok.com/@user/video/123") # => true
|
20
|
+
client.video_url?("https://dailymotion.com/video/x7abc123") # => true
|
21
|
+
|
22
|
+
# Non-video URLs return false
|
23
|
+
client.video_url?("https://youtube.com") # => false
|
24
|
+
client.video_url?("https://youtube.com/@channel") # => false
|
25
|
+
client.video_url?("https://google.com/search?q=cats") # => false
|
26
|
+
```
|
27
|
+
|
28
|
+
### 📡 Comprehensive Video Hosting Lists
|
29
|
+
|
30
|
+
- **3,500+ video hosting domains** extracted from yt-dlp (YouTube-dl fork) extractors
|
31
|
+
- **50+ regex patterns** for identifying video content URLs
|
32
|
+
- **Automatic generation** using `bin/generate_video_lists` script
|
33
|
+
- **Remote list fetching** from GitHub repository
|
34
|
+
|
35
|
+
### 🔗 Remote Pattern Files
|
36
|
+
|
37
|
+
Video patterns are automatically fetched from remote GitHub repository:
|
38
|
+
|
39
|
+
- **Video domains**: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts`
|
40
|
+
- **URL patterns**: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt`
|
41
|
+
|
42
|
+
### 🧠 Enhanced Categorization
|
43
|
+
|
44
|
+
Regex categorization provides more specific categorization for video URLs:
|
45
|
+
|
46
|
+
```ruby
|
47
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
48
|
+
|
49
|
+
# Basic domain categorization
|
50
|
+
client.categorise('https://youtube.com') # => [:video_hosting]
|
51
|
+
|
52
|
+
# Enhanced content detection for actual video URLs
|
53
|
+
client.categorise('https://youtube.com/watch?v=abc123') # => [:video_hosting, :video_hosting_content]
|
54
|
+
```
|
55
|
+
|
56
|
+
## Technical Implementation
|
57
|
+
|
58
|
+
### Constants Updates
|
59
|
+
|
60
|
+
New constants in `UrlCategorise::Constants`:
|
61
|
+
|
62
|
+
```ruby
|
63
|
+
# Remote video URL patterns file
|
64
|
+
VIDEO_URL_PATTERNS_FILE = 'https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt'.freeze
|
65
|
+
|
66
|
+
# Updated video hosting category to use remote list
|
67
|
+
video_hosting: ['https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts']
|
68
|
+
```
|
69
|
+
|
70
|
+
### Client Enhancements
|
71
|
+
|
72
|
+
New `Client` class features:
|
73
|
+
|
74
|
+
```ruby
|
75
|
+
# Default regex patterns file now uses remote URL
|
76
|
+
attribute :regex_patterns_file, default: -> { VIDEO_URL_PATTERNS_FILE }
|
77
|
+
|
78
|
+
# New video URL detection method
|
79
|
+
def video_url?(url)
|
80
|
+
# 1. Check if URL is valid and not empty
|
81
|
+
# 2. Ensure regex categorization is enabled
|
82
|
+
# 3. Verify URL is from a video hosting domain
|
83
|
+
# 4. Check if URL matches video content patterns
|
84
|
+
# 5. Return true only if all conditions are met
|
85
|
+
end
|
86
|
+
|
87
|
+
# Enhanced pattern fetching supports remote URLs
|
88
|
+
def fetch_regex_patterns_content
|
89
|
+
# Supports HTTP/HTTPS, file://, and direct file paths
|
90
|
+
# Graceful error handling for network failures
|
91
|
+
end
|
92
|
+
```
|
93
|
+
|
94
|
+
### Pattern Generation
|
95
|
+
|
96
|
+
The `bin/generate_video_lists` script has been enhanced:
|
97
|
+
|
98
|
+
- **Fixed terminal output** - Clean progress display without character overlap
|
99
|
+
- **Suppressed regex warnings** - No more nested operator warnings
|
100
|
+
- **Improved domain extraction** - More comprehensive pattern matching
|
101
|
+
- **Manual high-priority patterns** - Curated YouTube, Vimeo, TikTok patterns
|
102
|
+
- **3,500+ domains extracted** - Up from 96 domains previously
|
103
|
+
|
104
|
+
## Usage Examples
|
105
|
+
|
106
|
+
### Basic Video URL Detection
|
107
|
+
|
108
|
+
```ruby
|
109
|
+
require 'url_categorise'
|
110
|
+
|
111
|
+
# Initialize client with regex categorization
|
112
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
113
|
+
|
114
|
+
# Test various video URLs
|
115
|
+
urls = [
|
116
|
+
'https://youtube.com/watch?v=dQw4w9WgXcQ', # YouTube video
|
117
|
+
'https://youtube.com', # YouTube homepage
|
118
|
+
'https://youtube.com/@pewdiepie', # YouTube channel
|
119
|
+
'https://vimeo.com/123456789', # Vimeo video
|
120
|
+
'https://tiktok.com/@user/video/123', # TikTok video
|
121
|
+
'https://google.com/search?q=cats' # Non-video site
|
122
|
+
]
|
123
|
+
|
124
|
+
urls.each do |url|
|
125
|
+
is_video = client.video_url?(url)
|
126
|
+
categories = client.categorise(url)
|
127
|
+
puts "#{url} -> video: #{is_video}, categories: #{categories}"
|
128
|
+
end
|
129
|
+
```
|
130
|
+
|
131
|
+
### Content Filtering Application
|
132
|
+
|
133
|
+
```ruby
|
134
|
+
class ContentFilter
|
135
|
+
def initialize
|
136
|
+
@client = UrlCategorise::Client.new(regex_categorization: true)
|
137
|
+
end
|
138
|
+
|
139
|
+
def filter_video_content(urls)
|
140
|
+
results = {
|
141
|
+
video_content: [],
|
142
|
+
video_sites: [],
|
143
|
+
other: []
|
144
|
+
}
|
145
|
+
|
146
|
+
urls.each do |url|
|
147
|
+
if @client.video_url?(url)
|
148
|
+
results[:video_content] << url
|
149
|
+
elsif @client.categorise(url).include?(:video_hosting)
|
150
|
+
results[:video_sites] << url
|
151
|
+
else
|
152
|
+
results[:other] << url
|
153
|
+
end
|
154
|
+
end
|
155
|
+
|
156
|
+
results
|
157
|
+
end
|
158
|
+
end
|
159
|
+
|
160
|
+
# Usage
|
161
|
+
filter = ContentFilter.new
|
162
|
+
results = filter.filter_video_content([
|
163
|
+
'https://youtube.com/watch?v=abc', # -> :video_content
|
164
|
+
'https://youtube.com/@channel', # -> :video_sites
|
165
|
+
'https://example.com' # -> :other
|
166
|
+
])
|
167
|
+
```
|
168
|
+
|
169
|
+
### Rails Integration
|
170
|
+
|
171
|
+
```ruby
|
172
|
+
# app/services/video_url_service.rb
|
173
|
+
class VideoUrlService
|
174
|
+
include Singleton
|
175
|
+
|
176
|
+
def initialize
|
177
|
+
@client = UrlCategorise::Client.new(
|
178
|
+
regex_categorization: true,
|
179
|
+
cache_dir: Rails.root.join('tmp', 'url_cache')
|
180
|
+
)
|
181
|
+
end
|
182
|
+
|
183
|
+
def video_content?(url)
|
184
|
+
Rails.cache.fetch("video_content_#{Digest::MD5.hexdigest(url)}", expires_in: 1.hour) do
|
185
|
+
@client.video_url?(url)
|
186
|
+
end
|
187
|
+
end
|
188
|
+
|
189
|
+
def categorize_video_url(url)
|
190
|
+
categories = @client.categorise(url)
|
191
|
+
is_video_content = @client.video_url?(url)
|
192
|
+
|
193
|
+
{
|
194
|
+
url: url,
|
195
|
+
categories: categories,
|
196
|
+
is_video_content: is_video_content,
|
197
|
+
risk_level: calculate_risk_level(categories)
|
198
|
+
}
|
199
|
+
end
|
200
|
+
|
201
|
+
private
|
202
|
+
|
203
|
+
def calculate_risk_level(categories)
|
204
|
+
return 'safe' if categories.empty?
|
205
|
+
return 'blocked' if (categories & [:malware, :phishing]).any?
|
206
|
+
return 'restricted' if (categories & [:pornography, :gambling]).any?
|
207
|
+
'allowed'
|
208
|
+
end
|
209
|
+
end
|
210
|
+
|
211
|
+
# Usage in controllers
|
212
|
+
class VideosController < ApplicationController
|
213
|
+
def check
|
214
|
+
url = params[:url]
|
215
|
+
result = VideoUrlService.instance.categorize_video_url(url)
|
216
|
+
render json: result
|
217
|
+
end
|
218
|
+
end
|
219
|
+
```
|
220
|
+
|
221
|
+
## Testing
|
222
|
+
|
223
|
+
Comprehensive test suite with 15 new tests covering:
|
224
|
+
|
225
|
+
### Core Functionality Tests
|
226
|
+
|
227
|
+
- Video URL detection when regex categorization is disabled/enabled
|
228
|
+
- Detection of video vs non-video domains
|
229
|
+
- Video content URLs vs homepage/profile URLs
|
230
|
+
- Multiple video hosting categories (`video`, `video_hosting`, etc.)
|
231
|
+
- Graceful handling of invalid URLs
|
232
|
+
|
233
|
+
### Network and Error Handling Tests
|
234
|
+
|
235
|
+
- Remote pattern file fetching with mocked HTTP requests
|
236
|
+
- Graceful failure when remote files are not accessible
|
237
|
+
- Timeout and network error handling
|
238
|
+
- Invalid regex pattern handling
|
239
|
+
|
240
|
+
### Test Results
|
241
|
+
|
242
|
+
- ✅ **383 tests, 2,729 assertions, 0 failures, 0 errors**
|
243
|
+
- ✅ **92.88% line coverage** (926/997 lines)
|
244
|
+
- ✅ All video URL detection functionality fully tested
|
245
|
+
|
246
|
+
## Performance Considerations
|
247
|
+
|
248
|
+
### Caching
|
249
|
+
|
250
|
+
- Pattern files are cached locally when `cache_dir` is specified
|
251
|
+
- Remote patterns are only fetched once per session
|
252
|
+
- Client-side caching recommended for high-traffic applications
|
253
|
+
|
254
|
+
### Memory Usage
|
255
|
+
|
256
|
+
- Regex patterns are compiled once during client initialization
|
257
|
+
- Minimal memory overhead for video URL detection
|
258
|
+
- Efficient Set-based domain lookups
|
259
|
+
|
260
|
+
### Network Optimization
|
261
|
+
|
262
|
+
- Remote pattern files are small (~50KB total)
|
263
|
+
- Graceful fallback to local files if remote fetch fails
|
264
|
+
- HTTP timeout configuration supported
|
265
|
+
|
266
|
+
## Migration Guide
|
267
|
+
|
268
|
+
### From Previous Versions
|
269
|
+
|
270
|
+
No breaking changes - video URL detection is entirely optional:
|
271
|
+
|
272
|
+
```ruby
|
273
|
+
# Existing code continues to work unchanged
|
274
|
+
client = UrlCategorise::Client.new
|
275
|
+
categories = client.categorise('youtube.com')
|
276
|
+
|
277
|
+
# Enable video URL detection when needed
|
278
|
+
client_with_video = UrlCategorise::Client.new(regex_categorization: true)
|
279
|
+
is_video = client_with_video.video_url?('https://youtube.com/watch?v=abc')
|
280
|
+
```
|
281
|
+
|
282
|
+
### Enabling Video Detection
|
283
|
+
|
284
|
+
To use video URL detection in existing applications:
|
285
|
+
|
286
|
+
1. **Enable regex categorization**: `regex_categorization: true`
|
287
|
+
2. **Use the new method**: `client.video_url?(url)`
|
288
|
+
3. **Optional configuration**: Custom pattern files, caching
|
289
|
+
|
290
|
+
### Configuration Options
|
291
|
+
|
292
|
+
```ruby
|
293
|
+
# Default configuration (recommended)
|
294
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
295
|
+
|
296
|
+
# Custom pattern file (local)
|
297
|
+
client = UrlCategorise::Client.new(
|
298
|
+
regex_categorization: true,
|
299
|
+
regex_patterns_file: 'path/to/custom/patterns.txt'
|
300
|
+
)
|
301
|
+
|
302
|
+
# Custom pattern file (remote)
|
303
|
+
client = UrlCategorise::Client.new(
|
304
|
+
regex_categorization: true,
|
305
|
+
regex_patterns_file: 'https://example.com/patterns.txt'
|
306
|
+
)
|
307
|
+
|
308
|
+
# With caching
|
309
|
+
client = UrlCategorise::Client.new(
|
310
|
+
regex_categorization: true,
|
311
|
+
cache_dir: './cache'
|
312
|
+
)
|
313
|
+
```
|
314
|
+
|
315
|
+
## Maintenance
|
316
|
+
|
317
|
+
### Updating Video Lists
|
318
|
+
|
319
|
+
The video hosting lists are automatically maintained in the GitHub repository. To generate updated lists locally:
|
320
|
+
|
321
|
+
```bash
|
322
|
+
# Generate fresh video hosting lists
|
323
|
+
ruby bin/generate_video_lists
|
324
|
+
|
325
|
+
# This creates/updates:
|
326
|
+
# - lists/video_hosting_domains.hosts (3,500+ domains)
|
327
|
+
# - lists/video_url_patterns.txt (50+ patterns)
|
328
|
+
```
|
329
|
+
|
330
|
+
### Manual Pattern Curation
|
331
|
+
|
332
|
+
High-priority patterns are manually curated in the generation script:
|
333
|
+
|
334
|
+
- YouTube video and Shorts URLs
|
335
|
+
- Vimeo video URLs
|
336
|
+
- Dailymotion video URLs
|
337
|
+
- Twitch video URLs
|
338
|
+
- TikTok video URLs
|
339
|
+
|
340
|
+
### List Health Monitoring
|
341
|
+
|
342
|
+
Use the existing health monitoring for video hosting lists:
|
343
|
+
|
344
|
+
```ruby
|
345
|
+
client = UrlCategorise::Client.new
|
346
|
+
report = client.check_all_lists
|
347
|
+
|
348
|
+
# Check video hosting category specifically
|
349
|
+
video_status = report[:successful_lists][:video_hosting]
|
350
|
+
puts "Video hosting lists: #{video_status&.length || 0} URLs working"
|
351
|
+
```
|
352
|
+
|
353
|
+
This feature provides enterprise-grade video URL detection while maintaining the gem's focus on performance, reliability, and ease of use.
|