UrlCategorise 0.1.6 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,353 @@
1
+ # Video URL Detection Feature
2
+
3
+ ## Overview
4
+
5
+ The UrlCategorise gem now includes advanced video URL detection capabilities that allow you to determine if a URL is a direct link to video content, rather than just a homepage, profile page, or other non-video resource on a video hosting domain.
6
+
7
+ ## Key Features
8
+
9
+ ### 🎬 Direct Video URL Detection
10
+
11
+ New `video_url?` method provides precise detection of video content URLs:
12
+
13
+ ```ruby
14
+ client = UrlCategorise::Client.new(regex_categorization: true)
15
+
16
+ # Direct video content URLs return true
17
+ client.video_url?("https://youtube.com/watch?v=dQw4w9WgXcQ") # => true
18
+ client.video_url?("https://vimeo.com/123456789") # => true
19
+ client.video_url?("https://tiktok.com/@user/video/123") # => true
20
+ client.video_url?("https://dailymotion.com/video/x7abc123") # => true
21
+
22
+ # Non-video URLs return false
23
+ client.video_url?("https://youtube.com") # => false
24
+ client.video_url?("https://youtube.com/@channel") # => false
25
+ client.video_url?("https://google.com/search?q=cats") # => false
26
+ ```
27
+
28
+ ### 📡 Comprehensive Video Hosting Lists
29
+
30
+ - **3,500+ video hosting domains** extracted from yt-dlp (YouTube-dl fork) extractors
31
+ - **50+ regex patterns** for identifying video content URLs
32
+ - **Automatic generation** using `bin/generate_video_lists` script
33
+ - **Remote list fetching** from GitHub repository
34
+
35
+ ### 🔗 Remote Pattern Files
36
+
37
+ Video patterns are automatically fetched from remote GitHub repository:
38
+
39
+ - **Video domains**: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts`
40
+ - **URL patterns**: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt`
41
+
42
+ ### 🧠 Enhanced Categorization
43
+
44
+ Regex categorization provides more specific categorization for video URLs:
45
+
46
+ ```ruby
47
+ client = UrlCategorise::Client.new(regex_categorization: true)
48
+
49
+ # Basic domain categorization
50
+ client.categorise('https://youtube.com') # => [:video_hosting]
51
+
52
+ # Enhanced content detection for actual video URLs
53
+ client.categorise('https://youtube.com/watch?v=abc123') # => [:video_hosting, :video_hosting_content]
54
+ ```
55
+
56
+ ## Technical Implementation
57
+
58
+ ### Constants Updates
59
+
60
+ New constants in `UrlCategorise::Constants`:
61
+
62
+ ```ruby
63
+ # Remote video URL patterns file
64
+ VIDEO_URL_PATTERNS_FILE = 'https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt'.freeze
65
+
66
+ # Updated video hosting category to use remote list
67
+ video_hosting: ['https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts']
68
+ ```
69
+
70
+ ### Client Enhancements
71
+
72
+ New `Client` class features:
73
+
74
+ ```ruby
75
+ # Default regex patterns file now uses remote URL
76
+ attribute :regex_patterns_file, default: -> { VIDEO_URL_PATTERNS_FILE }
77
+
78
+ # New video URL detection method
79
+ def video_url?(url)
80
+ # 1. Check if URL is valid and not empty
81
+ # 2. Ensure regex categorization is enabled
82
+ # 3. Verify URL is from a video hosting domain
83
+ # 4. Check if URL matches video content patterns
84
+ # 5. Return true only if all conditions are met
85
+ end
86
+
87
+ # Enhanced pattern fetching supports remote URLs
88
+ def fetch_regex_patterns_content
89
+ # Supports HTTP/HTTPS, file://, and direct file paths
90
+ # Graceful error handling for network failures
91
+ end
92
+ ```
93
+
94
+ ### Pattern Generation
95
+
96
+ The `bin/generate_video_lists` script has been enhanced:
97
+
98
+ - **Fixed terminal output** - Clean progress display without character overlap
99
+ - **Suppressed regex warnings** - No more nested operator warnings
100
+ - **Improved domain extraction** - More comprehensive pattern matching
101
+ - **Manual high-priority patterns** - Curated YouTube, Vimeo, TikTok patterns
102
+ - **3,500+ domains extracted** - Up from 96 domains previously
103
+
104
+ ## Usage Examples
105
+
106
+ ### Basic Video URL Detection
107
+
108
+ ```ruby
109
+ require 'url_categorise'
110
+
111
+ # Initialize client with regex categorization
112
+ client = UrlCategorise::Client.new(regex_categorization: true)
113
+
114
+ # Test various video URLs
115
+ urls = [
116
+ 'https://youtube.com/watch?v=dQw4w9WgXcQ', # YouTube video
117
+ 'https://youtube.com', # YouTube homepage
118
+ 'https://youtube.com/@pewdiepie', # YouTube channel
119
+ 'https://vimeo.com/123456789', # Vimeo video
120
+ 'https://tiktok.com/@user/video/123', # TikTok video
121
+ 'https://google.com/search?q=cats' # Non-video site
122
+ ]
123
+
124
+ urls.each do |url|
125
+ is_video = client.video_url?(url)
126
+ categories = client.categorise(url)
127
+ puts "#{url} -> video: #{is_video}, categories: #{categories}"
128
+ end
129
+ ```
130
+
131
+ ### Content Filtering Application
132
+
133
+ ```ruby
134
+ class ContentFilter
135
+ def initialize
136
+ @client = UrlCategorise::Client.new(regex_categorization: true)
137
+ end
138
+
139
+ def filter_video_content(urls)
140
+ results = {
141
+ video_content: [],
142
+ video_sites: [],
143
+ other: []
144
+ }
145
+
146
+ urls.each do |url|
147
+ if @client.video_url?(url)
148
+ results[:video_content] << url
149
+ elsif @client.categorise(url).include?(:video_hosting)
150
+ results[:video_sites] << url
151
+ else
152
+ results[:other] << url
153
+ end
154
+ end
155
+
156
+ results
157
+ end
158
+ end
159
+
160
+ # Usage
161
+ filter = ContentFilter.new
162
+ results = filter.filter_video_content([
163
+ 'https://youtube.com/watch?v=abc', # -> :video_content
164
+ 'https://youtube.com/@channel', # -> :video_sites
165
+ 'https://example.com' # -> :other
166
+ ])
167
+ ```
168
+
169
+ ### Rails Integration
170
+
171
+ ```ruby
172
+ # app/services/video_url_service.rb
173
+ class VideoUrlService
174
+ include Singleton
175
+
176
+ def initialize
177
+ @client = UrlCategorise::Client.new(
178
+ regex_categorization: true,
179
+ cache_dir: Rails.root.join('tmp', 'url_cache')
180
+ )
181
+ end
182
+
183
+ def video_content?(url)
184
+ Rails.cache.fetch("video_content_#{Digest::MD5.hexdigest(url)}", expires_in: 1.hour) do
185
+ @client.video_url?(url)
186
+ end
187
+ end
188
+
189
+ def categorize_video_url(url)
190
+ categories = @client.categorise(url)
191
+ is_video_content = @client.video_url?(url)
192
+
193
+ {
194
+ url: url,
195
+ categories: categories,
196
+ is_video_content: is_video_content,
197
+ risk_level: calculate_risk_level(categories)
198
+ }
199
+ end
200
+
201
+ private
202
+
203
+ def calculate_risk_level(categories)
204
+ return 'safe' if categories.empty?
205
+ return 'blocked' if (categories & [:malware, :phishing]).any?
206
+ return 'restricted' if (categories & [:pornography, :gambling]).any?
207
+ 'allowed'
208
+ end
209
+ end
210
+
211
+ # Usage in controllers
212
+ class VideosController < ApplicationController
213
+ def check
214
+ url = params[:url]
215
+ result = VideoUrlService.instance.categorize_video_url(url)
216
+ render json: result
217
+ end
218
+ end
219
+ ```
220
+
221
+ ## Testing
222
+
223
+ Comprehensive test suite with 15 new tests covering:
224
+
225
+ ### Core Functionality Tests
226
+
227
+ - Video URL detection when regex categorization is disabled/enabled
228
+ - Detection of video vs non-video domains
229
+ - Video content URLs vs homepage/profile URLs
230
+ - Multiple video hosting categories (`video`, `video_hosting`, etc.)
231
+ - Graceful handling of invalid URLs
232
+
233
+ ### Network and Error Handling Tests
234
+
235
+ - Remote pattern file fetching with mocked HTTP requests
236
+ - Graceful failure when remote files are not accessible
237
+ - Timeout and network error handling
238
+ - Invalid regex pattern handling
239
+
240
+ ### Test Results
241
+
242
+ - ✅ **383 tests, 2,729 assertions, 0 failures, 0 errors**
243
+ - ✅ **92.88% line coverage** (926/997 lines)
244
+ - ✅ All video URL detection functionality fully tested
245
+
246
+ ## Performance Considerations
247
+
248
+ ### Caching
249
+
250
+ - Pattern files are cached locally when `cache_dir` is specified
251
+ - Remote patterns are only fetched once per session
252
+ - Client-side caching recommended for high-traffic applications
253
+
254
+ ### Memory Usage
255
+
256
+ - Regex patterns are compiled once during client initialization
257
+ - Minimal memory overhead for video URL detection
258
+ - Efficient Set-based domain lookups
259
+
260
+ ### Network Optimization
261
+
262
+ - Remote pattern files are small (~50KB total)
263
+ - Graceful fallback to local files if remote fetch fails
264
+ - HTTP timeout configuration supported
265
+
266
+ ## Migration Guide
267
+
268
+ ### From Previous Versions
269
+
270
+ No breaking changes - video URL detection is entirely optional:
271
+
272
+ ```ruby
273
+ # Existing code continues to work unchanged
274
+ client = UrlCategorise::Client.new
275
+ categories = client.categorise('youtube.com')
276
+
277
+ # Enable video URL detection when needed
278
+ client_with_video = UrlCategorise::Client.new(regex_categorization: true)
279
+ is_video = client_with_video.video_url?('https://youtube.com/watch?v=abc')
280
+ ```
281
+
282
+ ### Enabling Video Detection
283
+
284
+ To use video URL detection in existing applications:
285
+
286
+ 1. **Enable regex categorization**: `regex_categorization: true`
287
+ 2. **Use the new method**: `client.video_url?(url)`
288
+ 3. **Optional configuration**: Custom pattern files, caching
289
+
290
+ ### Configuration Options
291
+
292
+ ```ruby
293
+ # Default configuration (recommended)
294
+ client = UrlCategorise::Client.new(regex_categorization: true)
295
+
296
+ # Custom pattern file (local)
297
+ client = UrlCategorise::Client.new(
298
+ regex_categorization: true,
299
+ regex_patterns_file: 'path/to/custom/patterns.txt'
300
+ )
301
+
302
+ # Custom pattern file (remote)
303
+ client = UrlCategorise::Client.new(
304
+ regex_categorization: true,
305
+ regex_patterns_file: 'https://example.com/patterns.txt'
306
+ )
307
+
308
+ # With caching
309
+ client = UrlCategorise::Client.new(
310
+ regex_categorization: true,
311
+ cache_dir: './cache'
312
+ )
313
+ ```
314
+
315
+ ## Maintenance
316
+
317
+ ### Updating Video Lists
318
+
319
+ The video hosting lists are automatically maintained in the GitHub repository. To generate updated lists locally:
320
+
321
+ ```bash
322
+ # Generate fresh video hosting lists
323
+ ruby bin/generate_video_lists
324
+
325
+ # This creates/updates:
326
+ # - lists/video_hosting_domains.hosts (3,500+ domains)
327
+ # - lists/video_url_patterns.txt (50+ patterns)
328
+ ```
329
+
330
+ ### Manual Pattern Curation
331
+
332
+ High-priority patterns are manually curated in the generation script:
333
+
334
+ - YouTube video and Shorts URLs
335
+ - Vimeo video URLs
336
+ - Dailymotion video URLs
337
+ - Twitch video URLs
338
+ - TikTok video URLs
339
+
340
+ ### List Health Monitoring
341
+
342
+ Use the existing health monitoring for video hosting lists:
343
+
344
+ ```ruby
345
+ client = UrlCategorise::Client.new
346
+ report = client.check_all_lists
347
+
348
+ # Check video hosting category specifically
349
+ video_status = report[:successful_lists][:video_hosting]
350
+ puts "Video hosting lists: #{video_status&.length || 0} URLs working"
351
+ ```
352
+
353
+ This feature provides enterprise-grade video URL detection while maintaining the gem's focus on performance, reliability, and ease of use.