UrlCategorise 0.1.3 ā 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/settings.local.json +8 -2
- data/.gitignore +2 -0
- data/CLAUDE.md +140 -2
- data/Gemfile.lock +17 -1
- data/README.md +450 -7
- data/bin/export_csv +120 -0
- data/bin/export_hosts +68 -0
- data/bin/generate_video_lists +373 -0
- data/bin/rake +2 -0
- data/correct_usage_example.rb +64 -0
- data/docs/v0.1.4-features.md +215 -0
- data/docs/video-url-detection.md +353 -0
- data/lib/url_categorise/active_record_client.rb +1 -1
- data/lib/url_categorise/client.rb +699 -39
- data/lib/url_categorise/constants.rb +9 -6
- data/lib/url_categorise/dataset_processor.rb +27 -10
- data/lib/url_categorise/iab_compliance.rb +149 -0
- data/lib/url_categorise/version.rb +1 -1
- data/lib/url_categorise.rb +2 -0
- data/lists/video_hosting_domains.hosts +7057 -0
- data/lists/video_url_patterns.txt +297 -0
- data/url_categorise.gemspec +5 -2
- metadata +70 -3
data/README.md
CHANGED
@@ -5,6 +5,10 @@ A comprehensive Ruby gem for categorizing URLs and domains based on various secu
|
|
5
5
|
## Features
|
6
6
|
|
7
7
|
- **Comprehensive Coverage**: 60+ high-quality categories including security, content, and specialized lists
|
8
|
+
- **Video Content Detection**: Advanced regex-based categorization with `video_url?` method to distinguish video content from other website resources
|
9
|
+
- **Custom Video Lists**: Generate and maintain comprehensive video hosting domain lists using yt-dlp extractors
|
10
|
+
- **Kaggle Dataset Integration**: Automatic loading and processing of machine learning datasets from Kaggle
|
11
|
+
- **Multiple Data Sources**: Supports blocklists, CSV datasets, and Kaggle ML datasets
|
8
12
|
- **Multiple List Formats**: Supports hosts files, pfSense, AdSense, uBlock Origin, dnsmasq, and plain text formats
|
9
13
|
- **Intelligent Caching**: Hash-based file update detection with configurable local cache
|
10
14
|
- **DNS Resolution**: Resolve domains to IPs and check against IP-based blocklists
|
@@ -14,6 +18,10 @@ A comprehensive Ruby gem for categorizing URLs and domains based on various secu
|
|
14
18
|
- **Metadata Tracking**: Track last update times, ETags, and content hashes
|
15
19
|
- **Health Monitoring**: Automatic detection and removal of broken blocklist sources
|
16
20
|
- **List Validation**: Built-in tools to verify all configured URLs are accessible
|
21
|
+
- **Auto-Loading Datasets**: Automatic processing of predefined datasets during client initialization
|
22
|
+
- **ActiveAttr Settings**: In-memory modification of client settings using attribute setters
|
23
|
+
- **Data Export**: Export categorized data as hosts files per category or comprehensive CSV exports
|
24
|
+
- **CLI Commands**: Command-line utilities for data export and list checking
|
17
25
|
|
18
26
|
## Installation
|
19
27
|
|
@@ -44,6 +52,15 @@ puts "Total hosts: #{client.count_of_hosts}"
|
|
44
52
|
puts "Categories: #{client.count_of_categories}"
|
45
53
|
puts "Data size: #{client.size_of_data} MB"
|
46
54
|
|
55
|
+
# Get detailed size breakdown
|
56
|
+
puts "Total data size: #{client.size_of_data} MB (#{client.size_of_data_bytes} bytes)"
|
57
|
+
puts "Blocklist data size: #{client.size_of_blocklist_data} MB (#{client.size_of_blocklist_data_bytes} bytes)"
|
58
|
+
puts "Dataset data size: #{client.size_of_dataset_data} MB (#{client.size_of_dataset_data_bytes} bytes)"
|
59
|
+
|
60
|
+
# Get dataset-specific statistics (if datasets are loaded)
|
61
|
+
puts "Dataset hosts: #{client.count_of_dataset_hosts}"
|
62
|
+
puts "Dataset categories: #{client.count_of_dataset_categories}"
|
63
|
+
|
47
64
|
# Categorize a URL or domain
|
48
65
|
categories = client.categorise("badsite.com")
|
49
66
|
puts "Categories: #{categories}" # => [:malware, :phishing]
|
@@ -57,6 +74,116 @@ ip_categories = client.categorise_ip("192.168.1.100")
|
|
57
74
|
puts "IP categories: #{ip_categories}"
|
58
75
|
```
|
59
76
|
|
77
|
+
## New Features
|
78
|
+
|
79
|
+
### Dynamic Settings with ActiveAttr
|
80
|
+
|
81
|
+
The Client class now supports in-memory modification of settings using ActiveAttr:
|
82
|
+
|
83
|
+
```ruby
|
84
|
+
client = UrlCategorise::Client.new
|
85
|
+
|
86
|
+
# Modify settings dynamically
|
87
|
+
client.smart_categorization_enabled = true
|
88
|
+
client.iab_compliance_enabled = true
|
89
|
+
client.iab_version = :v2
|
90
|
+
client.request_timeout = 30
|
91
|
+
client.dns_servers = ['8.8.8.8', '8.8.4.4']
|
92
|
+
|
93
|
+
# Settings take effect immediately - no need to recreate the client
|
94
|
+
categories = client.categorise('reddit.com') # Uses new smart categorization rules
|
95
|
+
```
|
96
|
+
|
97
|
+
### Data Export Features
|
98
|
+
|
99
|
+
#### Hosts File Export
|
100
|
+
|
101
|
+
Export all categorized domains as separate hosts files per category:
|
102
|
+
|
103
|
+
```ruby
|
104
|
+
# Export to default location
|
105
|
+
result = client.export_hosts_files
|
106
|
+
|
107
|
+
# Export to custom location
|
108
|
+
result = client.export_hosts_files('/custom/export/path')
|
109
|
+
|
110
|
+
# Result includes file information and summary
|
111
|
+
puts "Exported #{result[:_summary][:total_categories]} categories"
|
112
|
+
puts "Total domains: #{result[:_summary][:total_domains]}"
|
113
|
+
puts "Files saved to: #{result[:_summary][:export_directory]}"
|
114
|
+
```
|
115
|
+
|
116
|
+
Each category gets its own hosts file (e.g., `malware.hosts`, `advertising.hosts`) with proper headers and sorted domains.
|
117
|
+
|
118
|
+
#### CSV Data Export
|
119
|
+
|
120
|
+
Export all data as a single comprehensive CSV file for AI training and analysis:
|
121
|
+
|
122
|
+
```ruby
|
123
|
+
# Export to default location
|
124
|
+
result = client.export_csv_data
|
125
|
+
|
126
|
+
# Export to custom location with IAB compliance
|
127
|
+
client.iab_compliance_enabled = true
|
128
|
+
result = client.export_csv_data('/custom/export/path')
|
129
|
+
|
130
|
+
# Returns information about created files:
|
131
|
+
# {
|
132
|
+
# csv_file: '/path/url_categorise_comprehensive_export_20231201_143022.csv',
|
133
|
+
# summary_file: '/path/export_summary_20231201_143022.json',
|
134
|
+
# total_entries: 50000,
|
135
|
+
# summary: { ... },
|
136
|
+
# export_directory: '/path'
|
137
|
+
# }
|
138
|
+
```
|
139
|
+
|
140
|
+
**Single comprehensive CSV file contains:**
|
141
|
+
|
142
|
+
- **Domain Categorization Data**: All processed domains with categories, source types, IAB mappings
|
143
|
+
- **Raw Dataset Content**: Original dataset entries with titles, descriptions, text, summaries, and all available fields
|
144
|
+
- **Dynamic Headers**: Automatically adapts to include all available data fields
|
145
|
+
- **Data Type Column**: Distinguishes between 'domain_categorization', 'raw_dataset_content', etc.
|
146
|
+
|
147
|
+
**Key Features:**
|
148
|
+
- Everything in one file for easy analysis and AI/ML training
|
149
|
+
- Rich textual content from original datasets
|
150
|
+
- IAB Content Taxonomy compliance mapping
|
151
|
+
- Smart categorization metadata
|
152
|
+
- Source type tracking (dataset vs blocklist)
|
153
|
+
|
154
|
+
#### CLI Commands
|
155
|
+
|
156
|
+
Command-line utilities for data export:
|
157
|
+
|
158
|
+
```bash
|
159
|
+
# Export hosts files
|
160
|
+
$ bundle exec export_hosts --output /tmp/hosts --verbose
|
161
|
+
|
162
|
+
# Export CSV data with all features enabled
|
163
|
+
$ bundle exec export_csv --output /tmp/csv --iab-compliance --smart-categorization --auto-load-datasets --verbose
|
164
|
+
|
165
|
+
# Generate updated video hosting lists
|
166
|
+
$ ruby bin/generate_video_lists
|
167
|
+
|
168
|
+
# Check health of all blocklist URLs
|
169
|
+
$ bundle exec check_lists
|
170
|
+
|
171
|
+
# Export with custom Kaggle credentials
|
172
|
+
$ bundle exec export_csv --auto-load-datasets --kaggle-credentials ~/my-kaggle.json --verbose
|
173
|
+
|
174
|
+
# Basic export (domains only)
|
175
|
+
$ bundle exec export_csv --output /tmp/csv
|
176
|
+
|
177
|
+
# Check URL health (existing command)
|
178
|
+
$ bundle exec check_lists
|
179
|
+
```
|
180
|
+
|
181
|
+
**Key CLI Options:**
|
182
|
+
- `--auto-load-datasets`: Load datasets from constants to include rich text content
|
183
|
+
- `--kaggle-credentials FILE`: Specify custom Kaggle credentials file
|
184
|
+
- `--iab-compliance`: Enable IAB Content Taxonomy mapping
|
185
|
+
- `--smart-categorization`: Enable intelligent category filtering
|
186
|
+
|
60
187
|
## Advanced Configuration
|
61
188
|
|
62
189
|
### File Caching
|
@@ -113,7 +240,11 @@ client = UrlCategorise::Client.new(
|
|
113
240
|
cache_dir: "./url_cache", # Enable local caching
|
114
241
|
force_download: false, # Use cache when available
|
115
242
|
dns_servers: ['1.1.1.1', '1.0.0.1'], # Cloudflare DNS servers
|
116
|
-
request_timeout: 15
|
243
|
+
request_timeout: 15, # 15 second HTTP timeout
|
244
|
+
iab_compliance: true, # Enable IAB compliance
|
245
|
+
iab_version: :v3, # Use IAB Content Taxonomy v3.0
|
246
|
+
auto_load_datasets: false, # Disable automatic dataset loading (default)
|
247
|
+
smart_categorization: false # Disable smart post-processing (default)
|
117
248
|
)
|
118
249
|
```
|
119
250
|
|
@@ -132,6 +263,248 @@ host_urls = {
|
|
132
263
|
client = UrlCategorise::Client.new(host_urls: host_urls)
|
133
264
|
```
|
134
265
|
|
266
|
+
### Video Content Detection
|
267
|
+
|
268
|
+
The gem includes advanced regex-based categorization specifically for video hosting platforms. This helps distinguish between actual video content URLs and other resources like homepages, user profiles, playlists, or community content.
|
269
|
+
|
270
|
+
#### Video Hosting Domains
|
271
|
+
|
272
|
+
The gem maintains a comprehensive list of video hosting domains extracted from yt-dlp (YouTube-dl fork) extractors:
|
273
|
+
|
274
|
+
```ruby
|
275
|
+
# Generate/update video hosting lists
|
276
|
+
system("ruby bin/generate_video_lists")
|
277
|
+
|
278
|
+
# Use video hosting categorization
|
279
|
+
client = UrlCategorise::Client.new
|
280
|
+
categories = client.categorise("youtube.com")
|
281
|
+
# => [:video_hosting]
|
282
|
+
```
|
283
|
+
|
284
|
+
#### Video Content vs Other Resources
|
285
|
+
|
286
|
+
Enable regex categorization to distinguish video content from other resources:
|
287
|
+
|
288
|
+
```ruby
|
289
|
+
client = UrlCategorise::Client.new(
|
290
|
+
regex_categorization: true # Uses remote video patterns by default
|
291
|
+
)
|
292
|
+
|
293
|
+
# Regular homepage gets basic category
|
294
|
+
client.categorise("https://youtube.com")
|
295
|
+
# => [:video_hosting]
|
296
|
+
|
297
|
+
# Actual video URL gets enhanced categorization
|
298
|
+
client.categorise("https://youtube.com/watch?v=dQw4w9WgXcQ")
|
299
|
+
# => [:video_hosting, :video_hosting_content]
|
300
|
+
|
301
|
+
# User profile page - no content enhancement
|
302
|
+
client.categorise("https://youtube.com/@username")
|
303
|
+
# => [:video_hosting]
|
304
|
+
```
|
305
|
+
|
306
|
+
#### Direct Video URL Detection
|
307
|
+
|
308
|
+
Use the `video_url?` method to check if a URL is a direct link to video content:
|
309
|
+
|
310
|
+
```ruby
|
311
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
312
|
+
|
313
|
+
# Check if URLs are direct video content links
|
314
|
+
client.video_url?("https://youtube.com/watch?v=dQw4w9WgXcQ") # => true
|
315
|
+
client.video_url?("https://youtube.com") # => false
|
316
|
+
client.video_url?("https://youtube.com/@channel") # => false
|
317
|
+
client.video_url?("https://vimeo.com/123456789") # => true
|
318
|
+
client.video_url?("https://tiktok.com/@user/video/123") # => true
|
319
|
+
|
320
|
+
# Works with various video hosting platforms
|
321
|
+
client.video_url?("https://dailymotion.com/video/x7abc123") # => true
|
322
|
+
client.video_url?("https://twitch.tv/videos/1234567890") # => true
|
323
|
+
|
324
|
+
# Returns false for non-video domains
|
325
|
+
client.video_url?("https://google.com/search?q=cats") # => false
|
326
|
+
```
|
327
|
+
|
328
|
+
**How it works:**
|
329
|
+
1. First checks if the URL is from a known video hosting domain
|
330
|
+
2. Then uses regex patterns to determine if it's a direct video content URL
|
331
|
+
3. Returns `true` only if both conditions are met
|
332
|
+
4. Handles invalid URLs gracefully (returns `false`)
|
333
|
+
|
334
|
+
#### Maintaining Video Lists
|
335
|
+
|
336
|
+
The gem includes a script to generate and maintain comprehensive video hosting lists:
|
337
|
+
|
338
|
+
```bash
|
339
|
+
# Generate updated video hosting lists
|
340
|
+
ruby bin/generate_video_lists
|
341
|
+
|
342
|
+
# This creates:
|
343
|
+
# - lists/video_hosting_domains.hosts (PiHole compatible)
|
344
|
+
# - lists/video_url_patterns.txt (Regex patterns for content detection)
|
345
|
+
```
|
346
|
+
|
347
|
+
The script fetches data from yt-dlp extractors and combines it with manually curated major platforms to ensure comprehensive coverage.
|
348
|
+
|
349
|
+
### Smart Categorization (Post-Processing)
|
350
|
+
|
351
|
+
Smart categorization solves the problem of overly broad domain-level categorization. For example, `reddit.com` might appear in health & fitness blocklists, but not all Reddit content is health-related.
|
352
|
+
|
353
|
+
#### The Problem
|
354
|
+
|
355
|
+
```ruby
|
356
|
+
# Without smart categorization
|
357
|
+
client.categorise("reddit.com")
|
358
|
+
# => [:reddit, :social_media, :health_and_fitness, :forums] # Too broad!
|
359
|
+
|
360
|
+
client.categorise("reddit.com/r/technology")
|
361
|
+
# => [:reddit, :social_media, :health_and_fitness, :forums] # Still wrong!
|
362
|
+
```
|
363
|
+
|
364
|
+
#### The Solution
|
365
|
+
|
366
|
+
```ruby
|
367
|
+
# Enable smart categorization
|
368
|
+
client = UrlCategorise::Client.new(
|
369
|
+
smart_categorization: true # Remove overly broad categories
|
370
|
+
)
|
371
|
+
|
372
|
+
client.categorise("reddit.com")
|
373
|
+
# => [:reddit, :social_media] # Much more accurate!
|
374
|
+
```
|
375
|
+
|
376
|
+
#### How It Works
|
377
|
+
|
378
|
+
Smart categorization automatically removes overly broad categories for known platforms:
|
379
|
+
|
380
|
+
- **Social Media Platforms** (Reddit, Facebook, Twitter, etc.): Removes categories like `:health_and_fitness`, `:forums`, `:news`, `:technology`, `:education`
|
381
|
+
- **Search Engines** (Google, Bing, etc.): Removes categories like `:news`, `:shopping`, `:travel`
|
382
|
+
- **Video Platforms** (YouTube, Vimeo, etc.): Removes categories like `:education`, `:entertainment`, `:music`
|
383
|
+
|
384
|
+
#### Custom Smart Rules
|
385
|
+
|
386
|
+
You can define custom rules for specific domains or URL patterns:
|
387
|
+
|
388
|
+
```ruby
|
389
|
+
custom_rules = {
|
390
|
+
reddit_subreddits: {
|
391
|
+
domains: ['reddit.com'],
|
392
|
+
remove_categories: [:health_and_fitness, :forums],
|
393
|
+
add_categories_by_path: {
|
394
|
+
/\/r\/fitness/ => [:health_and_fitness], # Add back for /r/fitness
|
395
|
+
/\/r\/technology/ => [:technology], # Add technology for /r/technology
|
396
|
+
/\/r\/programming/ => [:technology, :programming]
|
397
|
+
}
|
398
|
+
},
|
399
|
+
my_company_domains: {
|
400
|
+
domains: ['mycompany.com'],
|
401
|
+
allowed_categories_only: [:business, :technology] # Only allow specific categories
|
402
|
+
}
|
403
|
+
}
|
404
|
+
|
405
|
+
client = UrlCategorise::Client.new(
|
406
|
+
smart_categorization: true,
|
407
|
+
smart_rules: custom_rules
|
408
|
+
)
|
409
|
+
|
410
|
+
# Now path-based categorization works
|
411
|
+
client.categorise('reddit.com') # => [:reddit, :social_media]
|
412
|
+
client.categorise('reddit.com/r/fitness') # => [:reddit, :social_media, :health_and_fitness]
|
413
|
+
client.categorise('reddit.com/r/technology') # => [:reddit, :social_media, :technology]
|
414
|
+
```
|
415
|
+
|
416
|
+
#### Available Rule Types
|
417
|
+
|
418
|
+
- **`remove_categories`**: Remove specific categories for domains
|
419
|
+
- **`keep_primary_only`**: Keep only specified categories, remove others
|
420
|
+
- **`allowed_categories_only`**: Only allow specific categories, block all others
|
421
|
+
- **`add_categories_by_path`**: Add categories based on URL path patterns
|
422
|
+
|
423
|
+
#### Smart Rules with IAB Compliance
|
424
|
+
|
425
|
+
Smart categorization works seamlessly with IAB compliance:
|
426
|
+
|
427
|
+
```ruby
|
428
|
+
client = UrlCategorise::Client.new(
|
429
|
+
smart_categorization: true,
|
430
|
+
iab_compliance: true,
|
431
|
+
iab_version: :v3
|
432
|
+
)
|
433
|
+
|
434
|
+
# Returns clean IAB codes after smart processing
|
435
|
+
categories = client.categorise("reddit.com") # => ["14"] (Society - Social Media)
|
436
|
+
```
|
437
|
+
|
438
|
+
## IAB Content Taxonomy Compliance
|
439
|
+
|
440
|
+
UrlCategorise supports IAB (Interactive Advertising Bureau) Content Taxonomy compliance for standardized content categorization:
|
441
|
+
|
442
|
+
### Basic IAB Compliance
|
443
|
+
|
444
|
+
```ruby
|
445
|
+
# Enable IAB v3.0 compliance (default)
|
446
|
+
client = UrlCategorise::Client.new(
|
447
|
+
iab_compliance: true,
|
448
|
+
iab_version: :v3
|
449
|
+
)
|
450
|
+
|
451
|
+
# Enable IAB v2.0 compliance
|
452
|
+
client = UrlCategorise::Client.new(
|
453
|
+
iab_compliance: true,
|
454
|
+
iab_version: :v2
|
455
|
+
)
|
456
|
+
|
457
|
+
# Categorization returns IAB codes instead of custom categories
|
458
|
+
categories = client.categorise("badsite.com")
|
459
|
+
puts categories # => ["626"] (IAB v3 code for illegal content)
|
460
|
+
|
461
|
+
# Check IAB compliance status
|
462
|
+
puts client.iab_compliant? # => true
|
463
|
+
|
464
|
+
# Get IAB mapping for a specific category
|
465
|
+
puts client.get_iab_mapping(:malware) # => "626" (v3) or "IAB25" (v2)
|
466
|
+
```
|
467
|
+
|
468
|
+
### IAB Category Mappings
|
469
|
+
|
470
|
+
The gem maps security and content categories to appropriate IAB codes:
|
471
|
+
|
472
|
+
**IAB Content Taxonomy v3.0 (recommended):**
|
473
|
+
- `malware`, `phishing`, `illegal` ā `626` (Illegal Content)
|
474
|
+
- `advertising`, `mobile_ads` ā `3` (Advertising)
|
475
|
+
- `gambling` ā `7-39` (Gambling)
|
476
|
+
- `pornography` ā `626` (Adult Content)
|
477
|
+
- `social_media` ā `14` (Society)
|
478
|
+
- `technology` ā `19` (Technology & Computing)
|
479
|
+
|
480
|
+
**IAB Content Taxonomy v2.0:**
|
481
|
+
- `malware`, `phishing` ā `IAB25` (Non-Standard Content)
|
482
|
+
- `advertising` ā `IAB3` (Advertising)
|
483
|
+
- `gambling` ā `IAB7-39` (Gambling)
|
484
|
+
- `pornography` ā `IAB25-3` (Pornography)
|
485
|
+
|
486
|
+
### Integration with Datasets
|
487
|
+
|
488
|
+
IAB compliance works seamlessly with dataset processing:
|
489
|
+
|
490
|
+
```ruby
|
491
|
+
client = UrlCategorise::Client.new(
|
492
|
+
iab_compliance: true,
|
493
|
+
iab_version: :v3,
|
494
|
+
dataset_config: {
|
495
|
+
kaggle: { username: 'user', api_key: 'key' }
|
496
|
+
},
|
497
|
+
auto_load_datasets: true # Automatically load predefined datasets with IAB mapping
|
498
|
+
)
|
499
|
+
|
500
|
+
# Load additional datasets - categories will be mapped to IAB codes
|
501
|
+
client.load_kaggle_dataset('owner', 'dataset-name')
|
502
|
+
client.load_csv_dataset('https://example.com/data.csv')
|
503
|
+
|
504
|
+
# All categorization methods return IAB codes
|
505
|
+
categories = client.categorise("example.com") # => ["3", "626"]
|
506
|
+
```
|
507
|
+
|
135
508
|
## Available Categories
|
136
509
|
|
137
510
|
### Security & Threat Intelligence
|
@@ -194,9 +567,43 @@ ruby bin/check_lists
|
|
194
567
|
|
195
568
|
## Dataset Processing
|
196
569
|
|
197
|
-
UrlCategorise
|
570
|
+
UrlCategorise supports processing external datasets from Kaggle and CSV files to expand categorization data beyond traditional blocklists. This allows integration of machine learning datasets and custom URL classification data:
|
571
|
+
|
572
|
+
### Automatic Dataset Loading
|
573
|
+
|
574
|
+
Enable automatic loading of predefined datasets during client initialization:
|
198
575
|
|
199
|
-
|
576
|
+
```ruby
|
577
|
+
# Enable automatic dataset loading from constants
|
578
|
+
client = UrlCategorise::Client.new(
|
579
|
+
dataset_config: {
|
580
|
+
kaggle: {
|
581
|
+
username: ENV['KAGGLE_USERNAME'],
|
582
|
+
api_key: ENV['KAGGLE_API_KEY']
|
583
|
+
},
|
584
|
+
cache_path: './dataset_cache',
|
585
|
+
download_path: './downloads'
|
586
|
+
},
|
587
|
+
auto_load_datasets: true # Automatically loads all predefined datasets
|
588
|
+
)
|
589
|
+
|
590
|
+
# Datasets are now automatically integrated and ready for use
|
591
|
+
categories = client.categorise('https://example.com')
|
592
|
+
puts "Dataset categories loaded: #{client.count_of_dataset_categories}"
|
593
|
+
puts "Dataset hosts: #{client.count_of_dataset_hosts}"
|
594
|
+
```
|
595
|
+
|
596
|
+
The gem includes predefined high-quality datasets in constants:
|
597
|
+
- **`shaurov/website-classification-using-url`** - Comprehensive URL classification dataset
|
598
|
+
- **`hetulmehta/website-classification`** - Website categorization with cleaned text data
|
599
|
+
- **`shawon10/url-classification-dataset-dmoz`** - DMOZ-based URL classification
|
600
|
+
- **Data.world CSV dataset** - Additional URL categorization data
|
601
|
+
|
602
|
+
### Manual Dataset Loading
|
603
|
+
|
604
|
+
You can also load datasets manually for more control over the process:
|
605
|
+
|
606
|
+
#### Kaggle Dataset Integration
|
200
607
|
|
201
608
|
Load datasets directly from Kaggle using three authentication methods:
|
202
609
|
|
@@ -244,7 +651,7 @@ client.load_kaggle_dataset('owner', 'dataset-name', {
|
|
244
651
|
categories = client.categorise('https://example.com')
|
245
652
|
```
|
246
653
|
|
247
|
-
|
654
|
+
#### CSV Dataset Processing
|
248
655
|
|
249
656
|
Load datasets from direct CSV URLs:
|
250
657
|
|
@@ -286,7 +693,10 @@ dataset_config = {
|
|
286
693
|
timeout: 30 # HTTP timeout for downloads
|
287
694
|
}
|
288
695
|
|
289
|
-
client = UrlCategorise::Client.new(
|
696
|
+
client = UrlCategorise::Client.new(
|
697
|
+
dataset_config: dataset_config,
|
698
|
+
auto_load_datasets: true # Enable automatic loading of predefined datasets
|
699
|
+
)
|
290
700
|
```
|
291
701
|
|
292
702
|
### Disabling Kaggle Functionality
|
@@ -485,7 +895,18 @@ class UrlCategorizerService
|
|
485
895
|
cache_dir: Rails.root.join('tmp', 'url_cache'),
|
486
896
|
use_database: true,
|
487
897
|
force_download: Rails.env.development?,
|
488
|
-
request_timeout: Rails.env.production? ? 30 : 10 # Longer timeout in production
|
898
|
+
request_timeout: Rails.env.production? ? 30 : 10, # Longer timeout in production
|
899
|
+
iab_compliance: Rails.env.production?, # Enable IAB compliance in production
|
900
|
+
iab_version: :v3, # Use IAB Content Taxonomy v3.0
|
901
|
+
auto_load_datasets: Rails.env.production?, # Auto-load datasets in production
|
902
|
+
dataset_config: {
|
903
|
+
kaggle: {
|
904
|
+
username: ENV['KAGGLE_USERNAME'],
|
905
|
+
api_key: ENV['KAGGLE_API_KEY']
|
906
|
+
},
|
907
|
+
cache_path: Rails.root.join('tmp', 'dataset_cache'),
|
908
|
+
download_path: Rails.root.join('tmp', 'dataset_downloads')
|
909
|
+
}
|
489
910
|
)
|
490
911
|
end
|
491
912
|
|
@@ -508,12 +929,34 @@ class UrlCategorizerService
|
|
508
929
|
end
|
509
930
|
|
510
931
|
def stats
|
511
|
-
@client.database_stats
|
932
|
+
base_stats = @client.database_stats
|
933
|
+
base_stats.merge({
|
934
|
+
dataset_hosts: @client.count_of_dataset_hosts,
|
935
|
+
dataset_categories: @client.count_of_dataset_categories,
|
936
|
+
iab_compliant: @client.iab_compliant?,
|
937
|
+
iab_version: @client.iab_version
|
938
|
+
})
|
512
939
|
end
|
513
940
|
|
514
941
|
def refresh_lists!
|
515
942
|
@client.update_database
|
516
943
|
end
|
944
|
+
|
945
|
+
def load_dataset(type, identifier, options = {})
|
946
|
+
case type.to_s
|
947
|
+
when 'kaggle'
|
948
|
+
owner, dataset = identifier.split('/')
|
949
|
+
@client.load_kaggle_dataset(owner, dataset, options)
|
950
|
+
when 'csv'
|
951
|
+
@client.load_csv_dataset(identifier, options)
|
952
|
+
else
|
953
|
+
raise ArgumentError, "Unsupported dataset type: #{type}"
|
954
|
+
end
|
955
|
+
end
|
956
|
+
|
957
|
+
def get_iab_mapping(category)
|
958
|
+
@client.get_iab_mapping(category)
|
959
|
+
end
|
517
960
|
end
|
518
961
|
```
|
519
962
|
|
data/bin/export_csv
ADDED
@@ -0,0 +1,120 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'bundler/setup'
|
4
|
+
require 'optparse'
|
5
|
+
require_relative '../lib/url_categorise'
|
6
|
+
|
7
|
+
options = {
|
8
|
+
output_path: nil,
|
9
|
+
cache_dir: nil,
|
10
|
+
verbose: false,
|
11
|
+
iab_compliance: false,
|
12
|
+
smart_categorization: false,
|
13
|
+
auto_load_datasets: false,
|
14
|
+
kaggle_credentials_file: nil
|
15
|
+
}
|
16
|
+
|
17
|
+
OptionParser.new do |opts|
|
18
|
+
opts.banner = "Usage: #{$0} [options]"
|
19
|
+
opts.separator ""
|
20
|
+
opts.separator "Export all categorized domains and dataset content as a single CSV file for AI training"
|
21
|
+
opts.separator ""
|
22
|
+
|
23
|
+
opts.on("-o", "--output PATH", "Output directory path (default: cache_dir/exports/csv or ./exports/csv)") do |path|
|
24
|
+
options[:output_path] = path
|
25
|
+
end
|
26
|
+
|
27
|
+
opts.on("-c", "--cache-dir PATH", "Cache directory path for client initialization") do |path|
|
28
|
+
options[:cache_dir] = path
|
29
|
+
end
|
30
|
+
|
31
|
+
opts.on("--iab-compliance", "Enable IAB compliance for category mapping") do
|
32
|
+
options[:iab_compliance] = true
|
33
|
+
end
|
34
|
+
|
35
|
+
opts.on("--smart-categorization", "Enable smart categorization") do
|
36
|
+
options[:smart_categorization] = true
|
37
|
+
end
|
38
|
+
|
39
|
+
opts.on("--auto-load-datasets", "Auto-load datasets from constants for rich content export") do
|
40
|
+
options[:auto_load_datasets] = true
|
41
|
+
end
|
42
|
+
|
43
|
+
opts.on("--kaggle-credentials FILE", "Path to Kaggle credentials file (default: ~/.kaggle/kaggle.json)") do |file|
|
44
|
+
options[:kaggle_credentials_file] = file
|
45
|
+
end
|
46
|
+
|
47
|
+
opts.on("-v", "--verbose", "Verbose output") do
|
48
|
+
options[:verbose] = true
|
49
|
+
end
|
50
|
+
|
51
|
+
opts.on("-h", "--help", "Show this help message") do
|
52
|
+
puts opts
|
53
|
+
exit
|
54
|
+
end
|
55
|
+
end.parse!
|
56
|
+
|
57
|
+
puts "=== UrlCategorise CSV Data Export ===" if options[:verbose]
|
58
|
+
puts "Initializing client..." if options[:verbose]
|
59
|
+
|
60
|
+
begin
|
61
|
+
# Build dataset config if datasets should be loaded
|
62
|
+
dataset_config = {}
|
63
|
+
if options[:auto_load_datasets]
|
64
|
+
dataset_config = {
|
65
|
+
cache_path: options[:cache_dir] ? File.join(options[:cache_dir], 'datasets') : './url_cache/datasets',
|
66
|
+
download_path: options[:cache_dir] ? File.join(options[:cache_dir], 'downloads') : './url_cache/downloads'
|
67
|
+
}
|
68
|
+
|
69
|
+
# Add Kaggle credentials if provided
|
70
|
+
if options[:kaggle_credentials_file]
|
71
|
+
dataset_config[:kaggle] = { credentials_file: options[:kaggle_credentials_file] }
|
72
|
+
elsif File.exist?(File.expand_path('~/.kaggle/kaggle.json'))
|
73
|
+
dataset_config[:kaggle] = { credentials_file: '~/.kaggle/kaggle.json' }
|
74
|
+
end
|
75
|
+
end
|
76
|
+
|
77
|
+
client = UrlCategorise::Client.new(
|
78
|
+
cache_dir: options[:cache_dir],
|
79
|
+
iab_compliance: options[:iab_compliance],
|
80
|
+
smart_categorization: options[:smart_categorization],
|
81
|
+
auto_load_datasets: options[:auto_load_datasets],
|
82
|
+
dataset_config: dataset_config
|
83
|
+
)
|
84
|
+
|
85
|
+
if options[:verbose] && options[:auto_load_datasets]
|
86
|
+
puts "Client initialized with dataset loading enabled"
|
87
|
+
puts "Dataset statistics:"
|
88
|
+
puts " Dataset categories: #{client.count_of_dataset_categories}"
|
89
|
+
puts " Dataset hosts: #{client.count_of_dataset_hosts.to_s.reverse.gsub(/(\\d{3})(?=\\d)/, '\\1,').reverse}"
|
90
|
+
puts " Dataset data size: #{client.size_of_dataset_data.round(2)} MB" if client.respond_to?(:size_of_dataset_data)
|
91
|
+
end
|
92
|
+
|
93
|
+
puts "Exporting CSV data..." if options[:verbose]
|
94
|
+
|
95
|
+
result = client.export_csv_data(options[:output_path])
|
96
|
+
|
97
|
+
puts "\nā
Export completed successfully!"
|
98
|
+
puts "š Export directory: #{result[:export_directory]}"
|
99
|
+
puts "š CSV file: #{result[:csv_file]}"
|
100
|
+
puts "š Summary file: #{result[:summary_file]}"
|
101
|
+
|
102
|
+
puts "\nš Data Summary:"
|
103
|
+
puts " Total entries: #{result[:total_entries]}"
|
104
|
+
puts " Domain categorizations: #{result[:summary][:domain_categorization_entries]}"
|
105
|
+
puts " Dataset content entries: #{result[:summary][:dataset_content_entries]}"
|
106
|
+
puts " Total categories: #{result[:summary][:total_categories]}"
|
107
|
+
puts " Has dataset content: #{result[:summary][:has_dataset_content]}"
|
108
|
+
|
109
|
+
if options[:verbose]
|
110
|
+
puts "\nš·ļø Categories included:"
|
111
|
+
result[:summary][:categories].each do |category|
|
112
|
+
puts " - #{category}"
|
113
|
+
end
|
114
|
+
end
|
115
|
+
|
116
|
+
rescue StandardError => e
|
117
|
+
puts "ā Error: #{e.message}"
|
118
|
+
puts e.backtrace if options[:verbose]
|
119
|
+
exit 1
|
120
|
+
end
|