UrlCategorise 0.1.3 โ†’ 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,215 @@
1
+ # UrlCategorise v0.1.4 Release Notes
2
+
3
+ ## New Features Added
4
+
5
+ ### ๐Ÿ†• Dataset-Specific Helper Methods
6
+
7
+ Added dedicated helper methods for tracking dataset-only statistics separately from DNS blocklists:
8
+
9
+ ```ruby
10
+ client = UrlCategorise::Client.new(dataset_config: { kaggle: {} })
11
+ client.load_kaggle_dataset('owner', 'dataset-name')
12
+
13
+ # New dataset-specific methods
14
+ client.count_of_dataset_hosts # Returns hosts from datasets only
15
+ client.count_of_dataset_categories # Returns categories from datasets only
16
+
17
+ # Existing methods still work and count both
18
+ client.count_of_hosts # All hosts (DNS lists + datasets)
19
+ client.count_of_categories # All categories (DNS lists + datasets)
20
+ ```
21
+
22
+ **Implementation Details:**
23
+ - `count_of_dataset_hosts` - Sums hosts from categories tracked in `@dataset_categories` Set
24
+ - `count_of_dataset_categories` - Returns size of `@dataset_categories` Set
25
+ - Gracefully handles nil categories with safe navigation (`&.size || 0`)
26
+ - Both methods return 0 when no datasets are loaded
27
+
28
+ ### ๐Ÿ†• IAB Content Taxonomy Compliance
29
+
30
+ Full support for IAB (Interactive Advertising Bureau) Content Taxonomy standards:
31
+
32
+ ```ruby
33
+ # Enable IAB v3.0 compliance (recommended)
34
+ client = UrlCategorise::Client.new(
35
+ iab_compliance: true,
36
+ iab_version: :v3
37
+ )
38
+
39
+ # Enable IAB v2.0 compliance
40
+ client = UrlCategorise::Client.new(
41
+ iab_compliance: true,
42
+ iab_version: :v2
43
+ )
44
+
45
+ # Categorization returns IAB codes instead of custom categories
46
+ categories = client.categorise("badsite.com")
47
+ puts categories # => ["626"] (IAB v3 code for illegal content)
48
+
49
+ # Helper methods
50
+ client.iab_compliant? # => true/false
51
+ client.get_iab_mapping(:malware) # => "626" (v3) or "IAB25" (v2)
52
+ ```
53
+
54
+ **IAB Category Mappings:**
55
+
56
+ **IAB Content Taxonomy v3.0:**
57
+ - Security threats (`malware`, `phishing`, `illegal`) โ†’ `626` (Illegal Content)
58
+ - Advertising (`advertising`, `mobile_ads`) โ†’ `3` (Advertising)
59
+ - Gambling โ†’ `7-39` (Gambling subcategory)
60
+ - Adult content (`pornography`) โ†’ `626` (Adult Content)
61
+ - Social platforms โ†’ `14` (Society)
62
+ - Technology โ†’ `19` (Technology & Computing)
63
+
64
+ **IAB Content Taxonomy v2.0:**
65
+ - Security threats โ†’ `IAB25` (Non-Standard Content)
66
+ - Advertising โ†’ `IAB3` (Advertising)
67
+ - Gambling โ†’ `IAB7-39` (Gambling subcategory)
68
+ - Adult content โ†’ `IAB25-3` (Pornography)
69
+
70
+ **Implementation Details:**
71
+ - New `IabCompliance` module with comprehensive mappings
72
+ - Support for both v2.0 and v3.0 standards
73
+ - IAB compliance affects all categorization methods (`categorise`, `categorise_ip`, `resolve_and_categorise`)
74
+ - Automatic deduplication of IAB codes
75
+ - Graceful handling of unknown categories (returns 'Unknown')
76
+
77
+ ### ๐Ÿงช Comprehensive Test Coverage
78
+
79
+ Added extensive test suites for new features:
80
+
81
+ **Dataset Methods Tests:**
82
+ - `client_dataset_methods_test.rb` (9 tests)
83
+ - Tests for empty, populated, and mixed dataset scenarios
84
+ - Integration tests with existing methods
85
+ - Edge case handling (nil categories, empty arrays)
86
+
87
+ **IAB Compliance Tests:**
88
+ - `iab_compliance_test.rb` (14 tests) - Module functionality
89
+ - `client_iab_compliance_test.rb` (19 tests) - Client integration
90
+ - Comprehensive mapping validation for v2 and v3
91
+ - Integration with all categorization methods
92
+ - DNS resolution with IAB compliance
93
+ - Error handling and edge cases
94
+
95
+ **Total Test Stats:**
96
+ - **316 tests, 2455 assertions, 0 failures, 0 errors**
97
+ - **94.69% line coverage** (660/697 lines)
98
+ - All new features fully tested with edge cases
99
+
100
+ ### ๐Ÿ”ง Technical Implementation
101
+
102
+ **New Files:**
103
+ - `lib/url_categorise/iab_compliance.rb` - IAB mapping module
104
+ - `test/url_categorise/client_dataset_methods_test.rb` - Dataset helper tests
105
+ - `test/url_categorise/iab_compliance_test.rb` - IAB module tests
106
+ - `test/url_categorise/client_iab_compliance_test.rb` - IAB client integration tests
107
+
108
+ **Updated Files:**
109
+ - `lib/url_categorise/client.rb` - Added dataset helpers and IAB support
110
+ - `lib/url_categorise.rb` - Required new IAB module
111
+ - `lib/url_categorise/version.rb` - Bumped to 0.1.4
112
+
113
+ **New Client Attributes:**
114
+ - `iab_compliance_enabled` - Boolean flag for IAB compliance
115
+ - `iab_version` - IAB taxonomy version (:v2 or :v3)
116
+
117
+ **New Client Methods:**
118
+ - `count_of_dataset_hosts` - Dataset-specific host count
119
+ - `count_of_dataset_categories` - Dataset-specific category count
120
+ - `iab_compliant?` - Check IAB compliance status
121
+ - `get_iab_mapping(category)` - Get IAB code for category
122
+
123
+ ### ๐Ÿš€ Usage Examples
124
+
125
+ **Dataset Statistics:**
126
+ ```ruby
127
+ client = UrlCategorise::Client.new(dataset_config: { kaggle: {} })
128
+ client.load_kaggle_dataset('owner', 'dataset')
129
+
130
+ puts "Total hosts: #{client.count_of_hosts}" # All sources
131
+ puts "Dataset hosts: #{client.count_of_dataset_hosts}" # Datasets only
132
+ puts "DNS list hosts: #{client.count_of_hosts - client.count_of_dataset_hosts}"
133
+ ```
134
+
135
+ **IAB Compliance:**
136
+ ```ruby
137
+ # Production environment with IAB compliance
138
+ client = UrlCategorise::Client.new(
139
+ iab_compliance: true,
140
+ iab_version: :v3,
141
+ dataset_config: { kaggle: { username: 'user', api_key: 'key' } }
142
+ )
143
+
144
+ # All methods return IAB codes
145
+ domain_cats = client.categorise("example.com") # => ["3", "626"]
146
+ ip_cats = client.categorise_ip("192.168.1.100") # => ["626"]
147
+ resolved_cats = client.resolve_and_categorise("site.com") # => ["3"]
148
+
149
+ # Check compliance
150
+ puts "IAB compliant: #{client.iab_compliant?}" # => true
151
+ puts "Using version: #{client.iab_version}" # => :v3
152
+ ```
153
+
154
+ **Rails Service Integration:**
155
+ ```ruby
156
+ class UrlCategorizerService
157
+ def initialize
158
+ @client = UrlCategorise::ActiveRecordClient.new(
159
+ iab_compliance: Rails.env.production?,
160
+ iab_version: :v3,
161
+ dataset_config: {
162
+ kaggle: {
163
+ username: ENV['KAGGLE_USERNAME'],
164
+ api_key: ENV['KAGGLE_API_KEY']
165
+ }
166
+ }
167
+ )
168
+ end
169
+
170
+ def stats
171
+ {
172
+ total_hosts: @client.count_of_hosts,
173
+ dataset_hosts: @client.count_of_dataset_hosts,
174
+ dns_list_hosts: @client.count_of_hosts - @client.count_of_dataset_hosts,
175
+ iab_compliant: @client.iab_compliant?,
176
+ iab_version: @client.iab_version
177
+ }
178
+ end
179
+ end
180
+ ```
181
+
182
+ ### ๐Ÿ”„ Migration Guide
183
+
184
+ **From v0.1.3 to v0.1.4:**
185
+
186
+ 1. **No Breaking Changes** - All existing code continues to work
187
+ 2. **Optional New Features** - IAB compliance and dataset helpers are opt-in
188
+ 3. **Enhanced Statistics** - Use new helper methods for better insights
189
+
190
+ **Recommended Updates:**
191
+ ```ruby
192
+ # Before (still works)
193
+ client = UrlCategorise::Client.new
194
+ puts "Total: #{client.count_of_hosts}"
195
+
196
+ # After (enhanced)
197
+ client = UrlCategorise::Client.new(
198
+ iab_compliance: true, # Optional IAB compliance
199
+ iab_version: :v3 # Optional version selection
200
+ )
201
+ puts "Total hosts: #{client.count_of_hosts}"
202
+ puts "Dataset hosts: #{client.count_of_dataset_hosts}"
203
+ puts "IAB compliant: #{client.iab_compliant?}"
204
+ ```
205
+
206
+ ### ๐Ÿ“Š Quality Assurance
207
+
208
+ - โœ… All 316 tests pass with 0 failures/errors
209
+ - โœ… 94.69% line coverage maintained
210
+ - โœ… Code style enforced with rubocop
211
+ - โœ… Comprehensive edge case testing
212
+ - โœ… Memory-efficient implementation
213
+ - โœ… Backward compatibility preserved
214
+
215
+ This release adds powerful new features while maintaining the reliability and performance standards established in previous versions.
@@ -0,0 +1,353 @@
1
+ # Video URL Detection Feature
2
+
3
+ ## Overview
4
+
5
+ The UrlCategorise gem now includes advanced video URL detection capabilities that allow you to determine if a URL is a direct link to video content, rather than just a homepage, profile page, or other non-video resource on a video hosting domain.
6
+
7
+ ## Key Features
8
+
9
+ ### ๐ŸŽฌ Direct Video URL Detection
10
+
11
+ New `video_url?` method provides precise detection of video content URLs:
12
+
13
+ ```ruby
14
+ client = UrlCategorise::Client.new(regex_categorization: true)
15
+
16
+ # Direct video content URLs return true
17
+ client.video_url?("https://youtube.com/watch?v=dQw4w9WgXcQ") # => true
18
+ client.video_url?("https://vimeo.com/123456789") # => true
19
+ client.video_url?("https://tiktok.com/@user/video/123") # => true
20
+ client.video_url?("https://dailymotion.com/video/x7abc123") # => true
21
+
22
+ # Non-video URLs return false
23
+ client.video_url?("https://youtube.com") # => false
24
+ client.video_url?("https://youtube.com/@channel") # => false
25
+ client.video_url?("https://google.com/search?q=cats") # => false
26
+ ```
27
+
28
+ ### ๐Ÿ“ก Comprehensive Video Hosting Lists
29
+
30
+ - **3,500+ video hosting domains** extracted from yt-dlp (YouTube-dl fork) extractors
31
+ - **50+ regex patterns** for identifying video content URLs
32
+ - **Automatic generation** using `bin/generate_video_lists` script
33
+ - **Remote list fetching** from GitHub repository
34
+
35
+ ### ๐Ÿ”— Remote Pattern Files
36
+
37
+ Video patterns are automatically fetched from remote GitHub repository:
38
+
39
+ - **Video domains**: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts`
40
+ - **URL patterns**: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt`
41
+
42
+ ### ๐Ÿง  Enhanced Categorization
43
+
44
+ Regex categorization provides more specific categorization for video URLs:
45
+
46
+ ```ruby
47
+ client = UrlCategorise::Client.new(regex_categorization: true)
48
+
49
+ # Basic domain categorization
50
+ client.categorise('https://youtube.com') # => [:video_hosting]
51
+
52
+ # Enhanced content detection for actual video URLs
53
+ client.categorise('https://youtube.com/watch?v=abc123') # => [:video_hosting, :video_hosting_content]
54
+ ```
55
+
56
+ ## Technical Implementation
57
+
58
+ ### Constants Updates
59
+
60
+ New constants in `UrlCategorise::Constants`:
61
+
62
+ ```ruby
63
+ # Remote video URL patterns file
64
+ VIDEO_URL_PATTERNS_FILE = 'https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt'.freeze
65
+
66
+ # Updated video hosting category to use remote list
67
+ video_hosting: ['https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts']
68
+ ```
69
+
70
+ ### Client Enhancements
71
+
72
+ New `Client` class features:
73
+
74
+ ```ruby
75
+ # Default regex patterns file now uses remote URL
76
+ attribute :regex_patterns_file, default: -> { VIDEO_URL_PATTERNS_FILE }
77
+
78
+ # New video URL detection method
79
+ def video_url?(url)
80
+ # 1. Check if URL is valid and not empty
81
+ # 2. Ensure regex categorization is enabled
82
+ # 3. Verify URL is from a video hosting domain
83
+ # 4. Check if URL matches video content patterns
84
+ # 5. Return true only if all conditions are met
85
+ end
86
+
87
+ # Enhanced pattern fetching supports remote URLs
88
+ def fetch_regex_patterns_content
89
+ # Supports HTTP/HTTPS, file://, and direct file paths
90
+ # Graceful error handling for network failures
91
+ end
92
+ ```
93
+
94
+ ### Pattern Generation
95
+
96
+ The `bin/generate_video_lists` script has been enhanced:
97
+
98
+ - **Fixed terminal output** - Clean progress display without character overlap
99
+ - **Suppressed regex warnings** - No more nested operator warnings
100
+ - **Improved domain extraction** - More comprehensive pattern matching
101
+ - **Manual high-priority patterns** - Curated YouTube, Vimeo, TikTok patterns
102
+ - **3,500+ domains extracted** - Up from 96 domains previously
103
+
104
+ ## Usage Examples
105
+
106
+ ### Basic Video URL Detection
107
+
108
+ ```ruby
109
+ require 'url_categorise'
110
+
111
+ # Initialize client with regex categorization
112
+ client = UrlCategorise::Client.new(regex_categorization: true)
113
+
114
+ # Test various video URLs
115
+ urls = [
116
+ 'https://youtube.com/watch?v=dQw4w9WgXcQ', # YouTube video
117
+ 'https://youtube.com', # YouTube homepage
118
+ 'https://youtube.com/@pewdiepie', # YouTube channel
119
+ 'https://vimeo.com/123456789', # Vimeo video
120
+ 'https://tiktok.com/@user/video/123', # TikTok video
121
+ 'https://google.com/search?q=cats' # Non-video site
122
+ ]
123
+
124
+ urls.each do |url|
125
+ is_video = client.video_url?(url)
126
+ categories = client.categorise(url)
127
+ puts "#{url} -> video: #{is_video}, categories: #{categories}"
128
+ end
129
+ ```
130
+
131
+ ### Content Filtering Application
132
+
133
+ ```ruby
134
+ class ContentFilter
135
+ def initialize
136
+ @client = UrlCategorise::Client.new(regex_categorization: true)
137
+ end
138
+
139
+ def filter_video_content(urls)
140
+ results = {
141
+ video_content: [],
142
+ video_sites: [],
143
+ other: []
144
+ }
145
+
146
+ urls.each do |url|
147
+ if @client.video_url?(url)
148
+ results[:video_content] << url
149
+ elsif @client.categorise(url).include?(:video_hosting)
150
+ results[:video_sites] << url
151
+ else
152
+ results[:other] << url
153
+ end
154
+ end
155
+
156
+ results
157
+ end
158
+ end
159
+
160
+ # Usage
161
+ filter = ContentFilter.new
162
+ results = filter.filter_video_content([
163
+ 'https://youtube.com/watch?v=abc', # -> :video_content
164
+ 'https://youtube.com/@channel', # -> :video_sites
165
+ 'https://example.com' # -> :other
166
+ ])
167
+ ```
168
+
169
+ ### Rails Integration
170
+
171
+ ```ruby
172
+ # app/services/video_url_service.rb
173
+ class VideoUrlService
174
+ include Singleton
175
+
176
+ def initialize
177
+ @client = UrlCategorise::Client.new(
178
+ regex_categorization: true,
179
+ cache_dir: Rails.root.join('tmp', 'url_cache')
180
+ )
181
+ end
182
+
183
+ def video_content?(url)
184
+ Rails.cache.fetch("video_content_#{Digest::MD5.hexdigest(url)}", expires_in: 1.hour) do
185
+ @client.video_url?(url)
186
+ end
187
+ end
188
+
189
+ def categorize_video_url(url)
190
+ categories = @client.categorise(url)
191
+ is_video_content = @client.video_url?(url)
192
+
193
+ {
194
+ url: url,
195
+ categories: categories,
196
+ is_video_content: is_video_content,
197
+ risk_level: calculate_risk_level(categories)
198
+ }
199
+ end
200
+
201
+ private
202
+
203
+ def calculate_risk_level(categories)
204
+ return 'safe' if categories.empty?
205
+ return 'blocked' if (categories & [:malware, :phishing]).any?
206
+ return 'restricted' if (categories & [:pornography, :gambling]).any?
207
+ 'allowed'
208
+ end
209
+ end
210
+
211
+ # Usage in controllers
212
+ class VideosController < ApplicationController
213
+ def check
214
+ url = params[:url]
215
+ result = VideoUrlService.instance.categorize_video_url(url)
216
+ render json: result
217
+ end
218
+ end
219
+ ```
220
+
221
+ ## Testing
222
+
223
+ Comprehensive test suite with 15 new tests covering:
224
+
225
+ ### Core Functionality Tests
226
+
227
+ - Video URL detection when regex categorization is disabled/enabled
228
+ - Detection of video vs non-video domains
229
+ - Video content URLs vs homepage/profile URLs
230
+ - Multiple video hosting categories (`video`, `video_hosting`, etc.)
231
+ - Graceful handling of invalid URLs
232
+
233
+ ### Network and Error Handling Tests
234
+
235
+ - Remote pattern file fetching with mocked HTTP requests
236
+ - Graceful failure when remote files are not accessible
237
+ - Timeout and network error handling
238
+ - Invalid regex pattern handling
239
+
240
+ ### Test Results
241
+
242
+ - โœ… **383 tests, 2,729 assertions, 0 failures, 0 errors**
243
+ - โœ… **92.88% line coverage** (926/997 lines)
244
+ - โœ… All video URL detection functionality fully tested
245
+
246
+ ## Performance Considerations
247
+
248
+ ### Caching
249
+
250
+ - Pattern files are cached locally when `cache_dir` is specified
251
+ - Remote patterns are only fetched once per session
252
+ - Client-side caching recommended for high-traffic applications
253
+
254
+ ### Memory Usage
255
+
256
+ - Regex patterns are compiled once during client initialization
257
+ - Minimal memory overhead for video URL detection
258
+ - Efficient Set-based domain lookups
259
+
260
+ ### Network Optimization
261
+
262
+ - Remote pattern files are small (~50KB total)
263
+ - Graceful fallback to local files if remote fetch fails
264
+ - HTTP timeout configuration supported
265
+
266
+ ## Migration Guide
267
+
268
+ ### From Previous Versions
269
+
270
+ No breaking changes - video URL detection is entirely optional:
271
+
272
+ ```ruby
273
+ # Existing code continues to work unchanged
274
+ client = UrlCategorise::Client.new
275
+ categories = client.categorise('youtube.com')
276
+
277
+ # Enable video URL detection when needed
278
+ client_with_video = UrlCategorise::Client.new(regex_categorization: true)
279
+ is_video = client_with_video.video_url?('https://youtube.com/watch?v=abc')
280
+ ```
281
+
282
+ ### Enabling Video Detection
283
+
284
+ To use video URL detection in existing applications:
285
+
286
+ 1. **Enable regex categorization**: `regex_categorization: true`
287
+ 2. **Use the new method**: `client.video_url?(url)`
288
+ 3. **Optional configuration**: Custom pattern files, caching
289
+
290
+ ### Configuration Options
291
+
292
+ ```ruby
293
+ # Default configuration (recommended)
294
+ client = UrlCategorise::Client.new(regex_categorization: true)
295
+
296
+ # Custom pattern file (local)
297
+ client = UrlCategorise::Client.new(
298
+ regex_categorization: true,
299
+ regex_patterns_file: 'path/to/custom/patterns.txt'
300
+ )
301
+
302
+ # Custom pattern file (remote)
303
+ client = UrlCategorise::Client.new(
304
+ regex_categorization: true,
305
+ regex_patterns_file: 'https://example.com/patterns.txt'
306
+ )
307
+
308
+ # With caching
309
+ client = UrlCategorise::Client.new(
310
+ regex_categorization: true,
311
+ cache_dir: './cache'
312
+ )
313
+ ```
314
+
315
+ ## Maintenance
316
+
317
+ ### Updating Video Lists
318
+
319
+ The video hosting lists are automatically maintained in the GitHub repository. To generate updated lists locally:
320
+
321
+ ```bash
322
+ # Generate fresh video hosting lists
323
+ ruby bin/generate_video_lists
324
+
325
+ # This creates/updates:
326
+ # - lists/video_hosting_domains.hosts (3,500+ domains)
327
+ # - lists/video_url_patterns.txt (50+ patterns)
328
+ ```
329
+
330
+ ### Manual Pattern Curation
331
+
332
+ High-priority patterns are manually curated in the generation script:
333
+
334
+ - YouTube video and Shorts URLs
335
+ - Vimeo video URLs
336
+ - Dailymotion video URLs
337
+ - Twitch video URLs
338
+ - TikTok video URLs
339
+
340
+ ### List Health Monitoring
341
+
342
+ Use the existing health monitoring for video hosting lists:
343
+
344
+ ```ruby
345
+ client = UrlCategorise::Client.new
346
+ report = client.check_all_lists
347
+
348
+ # Check video hosting category specifically
349
+ video_status = report[:successful_lists][:video_hosting]
350
+ puts "Video hosting lists: #{video_status&.length || 0} URLs working"
351
+ ```
352
+
353
+ This feature provides enterprise-grade video URL detection while maintaining the gem's focus on performance, reliability, and ease of use.
@@ -118,7 +118,7 @@ module UrlCategorise
118
118
  return unless UrlCategorise::Models.available?
119
119
 
120
120
  # Store list metadata
121
- @host_urls.each do |category, urls|
121
+ (host_urls || {}).each do |category, urls|
122
122
  urls.each do |url|
123
123
  next unless url.is_a?(String)
124
124