UrlCategorise 0.1.3 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -5,6 +5,10 @@ A comprehensive Ruby gem for categorizing URLs and domains based on various secu
5
5
  ## Features
6
6
 
7
7
  - **Comprehensive Coverage**: 60+ high-quality categories including security, content, and specialized lists
8
+ - **Video Content Detection**: Advanced regex-based categorization with `video_url?` method to distinguish video content from other website resources
9
+ - **Custom Video Lists**: Generate and maintain comprehensive video hosting domain lists using yt-dlp extractors
10
+ - **Kaggle Dataset Integration**: Automatic loading and processing of machine learning datasets from Kaggle
11
+ - **Multiple Data Sources**: Supports blocklists, CSV datasets, and Kaggle ML datasets
8
12
  - **Multiple List Formats**: Supports hosts files, pfSense, AdSense, uBlock Origin, dnsmasq, and plain text formats
9
13
  - **Intelligent Caching**: Hash-based file update detection with configurable local cache
10
14
  - **DNS Resolution**: Resolve domains to IPs and check against IP-based blocklists
@@ -14,6 +18,10 @@ A comprehensive Ruby gem for categorizing URLs and domains based on various secu
14
18
  - **Metadata Tracking**: Track last update times, ETags, and content hashes
15
19
  - **Health Monitoring**: Automatic detection and removal of broken blocklist sources
16
20
  - **List Validation**: Built-in tools to verify all configured URLs are accessible
21
+ - **Auto-Loading Datasets**: Automatic processing of predefined datasets during client initialization
22
+ - **ActiveAttr Settings**: In-memory modification of client settings using attribute setters
23
+ - **Data Export**: Export categorized data as hosts files per category or comprehensive CSV exports
24
+ - **CLI Commands**: Command-line utilities for data export and list checking
17
25
 
18
26
  ## Installation
19
27
 
@@ -44,6 +52,15 @@ puts "Total hosts: #{client.count_of_hosts}"
44
52
  puts "Categories: #{client.count_of_categories}"
45
53
  puts "Data size: #{client.size_of_data} MB"
46
54
 
55
+ # Get detailed size breakdown
56
+ puts "Total data size: #{client.size_of_data} MB (#{client.size_of_data_bytes} bytes)"
57
+ puts "Blocklist data size: #{client.size_of_blocklist_data} MB (#{client.size_of_blocklist_data_bytes} bytes)"
58
+ puts "Dataset data size: #{client.size_of_dataset_data} MB (#{client.size_of_dataset_data_bytes} bytes)"
59
+
60
+ # Get dataset-specific statistics (if datasets are loaded)
61
+ puts "Dataset hosts: #{client.count_of_dataset_hosts}"
62
+ puts "Dataset categories: #{client.count_of_dataset_categories}"
63
+
47
64
  # Categorize a URL or domain
48
65
  categories = client.categorise("badsite.com")
49
66
  puts "Categories: #{categories}" # => [:malware, :phishing]
@@ -57,6 +74,116 @@ ip_categories = client.categorise_ip("192.168.1.100")
57
74
  puts "IP categories: #{ip_categories}"
58
75
  ```
59
76
 
77
+ ## New Features
78
+
79
+ ### Dynamic Settings with ActiveAttr
80
+
81
+ The Client class now supports in-memory modification of settings using ActiveAttr:
82
+
83
+ ```ruby
84
+ client = UrlCategorise::Client.new
85
+
86
+ # Modify settings dynamically
87
+ client.smart_categorization_enabled = true
88
+ client.iab_compliance_enabled = true
89
+ client.iab_version = :v2
90
+ client.request_timeout = 30
91
+ client.dns_servers = ['8.8.8.8', '8.8.4.4']
92
+
93
+ # Settings take effect immediately - no need to recreate the client
94
+ categories = client.categorise('reddit.com') # Uses new smart categorization rules
95
+ ```
96
+
97
+ ### Data Export Features
98
+
99
+ #### Hosts File Export
100
+
101
+ Export all categorized domains as separate hosts files per category:
102
+
103
+ ```ruby
104
+ # Export to default location
105
+ result = client.export_hosts_files
106
+
107
+ # Export to custom location
108
+ result = client.export_hosts_files('/custom/export/path')
109
+
110
+ # Result includes file information and summary
111
+ puts "Exported #{result[:_summary][:total_categories]} categories"
112
+ puts "Total domains: #{result[:_summary][:total_domains]}"
113
+ puts "Files saved to: #{result[:_summary][:export_directory]}"
114
+ ```
115
+
116
+ Each category gets its own hosts file (e.g., `malware.hosts`, `advertising.hosts`) with proper headers and sorted domains.
117
+
118
+ #### CSV Data Export
119
+
120
+ Export all data as a single comprehensive CSV file for AI training and analysis:
121
+
122
+ ```ruby
123
+ # Export to default location
124
+ result = client.export_csv_data
125
+
126
+ # Export to custom location with IAB compliance
127
+ client.iab_compliance_enabled = true
128
+ result = client.export_csv_data('/custom/export/path')
129
+
130
+ # Returns information about created files:
131
+ # {
132
+ # csv_file: '/path/url_categorise_comprehensive_export_20231201_143022.csv',
133
+ # summary_file: '/path/export_summary_20231201_143022.json',
134
+ # total_entries: 50000,
135
+ # summary: { ... },
136
+ # export_directory: '/path'
137
+ # }
138
+ ```
139
+
140
+ **Single comprehensive CSV file contains:**
141
+
142
+ - **Domain Categorization Data**: All processed domains with categories, source types, IAB mappings
143
+ - **Raw Dataset Content**: Original dataset entries with titles, descriptions, text, summaries, and all available fields
144
+ - **Dynamic Headers**: Automatically adapts to include all available data fields
145
+ - **Data Type Column**: Distinguishes between 'domain_categorization', 'raw_dataset_content', etc.
146
+
147
+ **Key Features:**
148
+ - Everything in one file for easy analysis and AI/ML training
149
+ - Rich textual content from original datasets
150
+ - IAB Content Taxonomy compliance mapping
151
+ - Smart categorization metadata
152
+ - Source type tracking (dataset vs blocklist)
153
+
154
+ #### CLI Commands
155
+
156
+ Command-line utilities for data export:
157
+
158
+ ```bash
159
+ # Export hosts files
160
+ $ bundle exec export_hosts --output /tmp/hosts --verbose
161
+
162
+ # Export CSV data with all features enabled
163
+ $ bundle exec export_csv --output /tmp/csv --iab-compliance --smart-categorization --auto-load-datasets --verbose
164
+
165
+ # Generate updated video hosting lists
166
+ $ ruby bin/generate_video_lists
167
+
168
+ # Check health of all blocklist URLs
169
+ $ bundle exec check_lists
170
+
171
+ # Export with custom Kaggle credentials
172
+ $ bundle exec export_csv --auto-load-datasets --kaggle-credentials ~/my-kaggle.json --verbose
173
+
174
+ # Basic export (domains only)
175
+ $ bundle exec export_csv --output /tmp/csv
176
+
177
+ # Check URL health (existing command)
178
+ $ bundle exec check_lists
179
+ ```
180
+
181
+ **Key CLI Options:**
182
+ - `--auto-load-datasets`: Load datasets from constants to include rich text content
183
+ - `--kaggle-credentials FILE`: Specify custom Kaggle credentials file
184
+ - `--iab-compliance`: Enable IAB Content Taxonomy mapping
185
+ - `--smart-categorization`: Enable intelligent category filtering
186
+
60
187
  ## Advanced Configuration
61
188
 
62
189
  ### File Caching
@@ -113,7 +240,11 @@ client = UrlCategorise::Client.new(
113
240
  cache_dir: "./url_cache", # Enable local caching
114
241
  force_download: false, # Use cache when available
115
242
  dns_servers: ['1.1.1.1', '1.0.0.1'], # Cloudflare DNS servers
116
- request_timeout: 15 # 15 second HTTP timeout
243
+ request_timeout: 15, # 15 second HTTP timeout
244
+ iab_compliance: true, # Enable IAB compliance
245
+ iab_version: :v3, # Use IAB Content Taxonomy v3.0
246
+ auto_load_datasets: false, # Disable automatic dataset loading (default)
247
+ smart_categorization: false # Disable smart post-processing (default)
117
248
  )
118
249
  ```
119
250
 
@@ -132,6 +263,248 @@ host_urls = {
132
263
  client = UrlCategorise::Client.new(host_urls: host_urls)
133
264
  ```
134
265
 
266
+ ### Video Content Detection
267
+
268
+ The gem includes advanced regex-based categorization specifically for video hosting platforms. This helps distinguish between actual video content URLs and other resources like homepages, user profiles, playlists, or community content.
269
+
270
+ #### Video Hosting Domains
271
+
272
+ The gem maintains a comprehensive list of video hosting domains extracted from yt-dlp (YouTube-dl fork) extractors:
273
+
274
+ ```ruby
275
+ # Generate/update video hosting lists
276
+ system("ruby bin/generate_video_lists")
277
+
278
+ # Use video hosting categorization
279
+ client = UrlCategorise::Client.new
280
+ categories = client.categorise("youtube.com")
281
+ # => [:video_hosting]
282
+ ```
283
+
284
+ #### Video Content vs Other Resources
285
+
286
+ Enable regex categorization to distinguish video content from other resources:
287
+
288
+ ```ruby
289
+ client = UrlCategorise::Client.new(
290
+ regex_categorization: true # Uses remote video patterns by default
291
+ )
292
+
293
+ # Regular homepage gets basic category
294
+ client.categorise("https://youtube.com")
295
+ # => [:video_hosting]
296
+
297
+ # Actual video URL gets enhanced categorization
298
+ client.categorise("https://youtube.com/watch?v=dQw4w9WgXcQ")
299
+ # => [:video_hosting, :video_hosting_content]
300
+
301
+ # User profile page - no content enhancement
302
+ client.categorise("https://youtube.com/@username")
303
+ # => [:video_hosting]
304
+ ```
305
+
306
+ #### Direct Video URL Detection
307
+
308
+ Use the `video_url?` method to check if a URL is a direct link to video content:
309
+
310
+ ```ruby
311
+ client = UrlCategorise::Client.new(regex_categorization: true)
312
+
313
+ # Check if URLs are direct video content links
314
+ client.video_url?("https://youtube.com/watch?v=dQw4w9WgXcQ") # => true
315
+ client.video_url?("https://youtube.com") # => false
316
+ client.video_url?("https://youtube.com/@channel") # => false
317
+ client.video_url?("https://vimeo.com/123456789") # => true
318
+ client.video_url?("https://tiktok.com/@user/video/123") # => true
319
+
320
+ # Works with various video hosting platforms
321
+ client.video_url?("https://dailymotion.com/video/x7abc123") # => true
322
+ client.video_url?("https://twitch.tv/videos/1234567890") # => true
323
+
324
+ # Returns false for non-video domains
325
+ client.video_url?("https://google.com/search?q=cats") # => false
326
+ ```
327
+
328
+ **How it works:**
329
+ 1. First checks if the URL is from a known video hosting domain
330
+ 2. Then uses regex patterns to determine if it's a direct video content URL
331
+ 3. Returns `true` only if both conditions are met
332
+ 4. Handles invalid URLs gracefully (returns `false`)
333
+
334
+ #### Maintaining Video Lists
335
+
336
+ The gem includes a script to generate and maintain comprehensive video hosting lists:
337
+
338
+ ```bash
339
+ # Generate updated video hosting lists
340
+ ruby bin/generate_video_lists
341
+
342
+ # This creates:
343
+ # - lists/video_hosting_domains.hosts (PiHole compatible)
344
+ # - lists/video_url_patterns.txt (Regex patterns for content detection)
345
+ ```
346
+
347
+ The script fetches data from yt-dlp extractors and combines it with manually curated major platforms to ensure comprehensive coverage.
348
+
349
+ ### Smart Categorization (Post-Processing)
350
+
351
+ Smart categorization solves the problem of overly broad domain-level categorization. For example, `reddit.com` might appear in health & fitness blocklists, but not all Reddit content is health-related.
352
+
353
+ #### The Problem
354
+
355
+ ```ruby
356
+ # Without smart categorization
357
+ client.categorise("reddit.com")
358
+ # => [:reddit, :social_media, :health_and_fitness, :forums] # Too broad!
359
+
360
+ client.categorise("reddit.com/r/technology")
361
+ # => [:reddit, :social_media, :health_and_fitness, :forums] # Still wrong!
362
+ ```
363
+
364
+ #### The Solution
365
+
366
+ ```ruby
367
+ # Enable smart categorization
368
+ client = UrlCategorise::Client.new(
369
+ smart_categorization: true # Remove overly broad categories
370
+ )
371
+
372
+ client.categorise("reddit.com")
373
+ # => [:reddit, :social_media] # Much more accurate!
374
+ ```
375
+
376
+ #### How It Works
377
+
378
+ Smart categorization automatically removes overly broad categories for known platforms:
379
+
380
+ - **Social Media Platforms** (Reddit, Facebook, Twitter, etc.): Removes categories like `:health_and_fitness`, `:forums`, `:news`, `:technology`, `:education`
381
+ - **Search Engines** (Google, Bing, etc.): Removes categories like `:news`, `:shopping`, `:travel`
382
+ - **Video Platforms** (YouTube, Vimeo, etc.): Removes categories like `:education`, `:entertainment`, `:music`
383
+
384
+ #### Custom Smart Rules
385
+
386
+ You can define custom rules for specific domains or URL patterns:
387
+
388
+ ```ruby
389
+ custom_rules = {
390
+ reddit_subreddits: {
391
+ domains: ['reddit.com'],
392
+ remove_categories: [:health_and_fitness, :forums],
393
+ add_categories_by_path: {
394
+ /\/r\/fitness/ => [:health_and_fitness], # Add back for /r/fitness
395
+ /\/r\/technology/ => [:technology], # Add technology for /r/technology
396
+ /\/r\/programming/ => [:technology, :programming]
397
+ }
398
+ },
399
+ my_company_domains: {
400
+ domains: ['mycompany.com'],
401
+ allowed_categories_only: [:business, :technology] # Only allow specific categories
402
+ }
403
+ }
404
+
405
+ client = UrlCategorise::Client.new(
406
+ smart_categorization: true,
407
+ smart_rules: custom_rules
408
+ )
409
+
410
+ # Now path-based categorization works
411
+ client.categorise('reddit.com') # => [:reddit, :social_media]
412
+ client.categorise('reddit.com/r/fitness') # => [:reddit, :social_media, :health_and_fitness]
413
+ client.categorise('reddit.com/r/technology') # => [:reddit, :social_media, :technology]
414
+ ```
415
+
416
+ #### Available Rule Types
417
+
418
+ - **`remove_categories`**: Remove specific categories for domains
419
+ - **`keep_primary_only`**: Keep only specified categories, remove others
420
+ - **`allowed_categories_only`**: Only allow specific categories, block all others
421
+ - **`add_categories_by_path`**: Add categories based on URL path patterns
422
+
423
+ #### Smart Rules with IAB Compliance
424
+
425
+ Smart categorization works seamlessly with IAB compliance:
426
+
427
+ ```ruby
428
+ client = UrlCategorise::Client.new(
429
+ smart_categorization: true,
430
+ iab_compliance: true,
431
+ iab_version: :v3
432
+ )
433
+
434
+ # Returns clean IAB codes after smart processing
435
+ categories = client.categorise("reddit.com") # => ["14"] (Society - Social Media)
436
+ ```
437
+
438
+ ## IAB Content Taxonomy Compliance
439
+
440
+ UrlCategorise supports IAB (Interactive Advertising Bureau) Content Taxonomy compliance for standardized content categorization:
441
+
442
+ ### Basic IAB Compliance
443
+
444
+ ```ruby
445
+ # Enable IAB v3.0 compliance (default)
446
+ client = UrlCategorise::Client.new(
447
+ iab_compliance: true,
448
+ iab_version: :v3
449
+ )
450
+
451
+ # Enable IAB v2.0 compliance
452
+ client = UrlCategorise::Client.new(
453
+ iab_compliance: true,
454
+ iab_version: :v2
455
+ )
456
+
457
+ # Categorization returns IAB codes instead of custom categories
458
+ categories = client.categorise("badsite.com")
459
+ puts categories # => ["626"] (IAB v3 code for illegal content)
460
+
461
+ # Check IAB compliance status
462
+ puts client.iab_compliant? # => true
463
+
464
+ # Get IAB mapping for a specific category
465
+ puts client.get_iab_mapping(:malware) # => "626" (v3) or "IAB25" (v2)
466
+ ```
467
+
468
+ ### IAB Category Mappings
469
+
470
+ The gem maps security and content categories to appropriate IAB codes:
471
+
472
+ **IAB Content Taxonomy v3.0 (recommended):**
473
+ - `malware`, `phishing`, `illegal` → `626` (Illegal Content)
474
+ - `advertising`, `mobile_ads` → `3` (Advertising)
475
+ - `gambling` → `7-39` (Gambling)
476
+ - `pornography` → `626` (Adult Content)
477
+ - `social_media` → `14` (Society)
478
+ - `technology` → `19` (Technology & Computing)
479
+
480
+ **IAB Content Taxonomy v2.0:**
481
+ - `malware`, `phishing` → `IAB25` (Non-Standard Content)
482
+ - `advertising` → `IAB3` (Advertising)
483
+ - `gambling` → `IAB7-39` (Gambling)
484
+ - `pornography` → `IAB25-3` (Pornography)
485
+
486
+ ### Integration with Datasets
487
+
488
+ IAB compliance works seamlessly with dataset processing:
489
+
490
+ ```ruby
491
+ client = UrlCategorise::Client.new(
492
+ iab_compliance: true,
493
+ iab_version: :v3,
494
+ dataset_config: {
495
+ kaggle: { username: 'user', api_key: 'key' }
496
+ },
497
+ auto_load_datasets: true # Automatically load predefined datasets with IAB mapping
498
+ )
499
+
500
+ # Load additional datasets - categories will be mapped to IAB codes
501
+ client.load_kaggle_dataset('owner', 'dataset-name')
502
+ client.load_csv_dataset('https://example.com/data.csv')
503
+
504
+ # All categorization methods return IAB codes
505
+ categories = client.categorise("example.com") # => ["3", "626"]
506
+ ```
507
+
135
508
  ## Available Categories
136
509
 
137
510
  ### Security & Threat Intelligence
@@ -194,9 +567,43 @@ ruby bin/check_lists
194
567
 
195
568
  ## Dataset Processing
196
569
 
197
- UrlCategorise now supports processing external datasets from Kaggle and CSV files to expand categorization data:
570
+ UrlCategorise supports processing external datasets from Kaggle and CSV files to expand categorization data beyond traditional blocklists. This allows integration of machine learning datasets and custom URL classification data:
571
+
572
+ ### Automatic Dataset Loading
573
+
574
+ Enable automatic loading of predefined datasets during client initialization:
198
575
 
199
- ### Kaggle Dataset Integration
576
+ ```ruby
577
+ # Enable automatic dataset loading from constants
578
+ client = UrlCategorise::Client.new(
579
+ dataset_config: {
580
+ kaggle: {
581
+ username: ENV['KAGGLE_USERNAME'],
582
+ api_key: ENV['KAGGLE_API_KEY']
583
+ },
584
+ cache_path: './dataset_cache',
585
+ download_path: './downloads'
586
+ },
587
+ auto_load_datasets: true # Automatically loads all predefined datasets
588
+ )
589
+
590
+ # Datasets are now automatically integrated and ready for use
591
+ categories = client.categorise('https://example.com')
592
+ puts "Dataset categories loaded: #{client.count_of_dataset_categories}"
593
+ puts "Dataset hosts: #{client.count_of_dataset_hosts}"
594
+ ```
595
+
596
+ The gem includes predefined high-quality datasets in constants:
597
+ - **`shaurov/website-classification-using-url`** - Comprehensive URL classification dataset
598
+ - **`hetulmehta/website-classification`** - Website categorization with cleaned text data
599
+ - **`shawon10/url-classification-dataset-dmoz`** - DMOZ-based URL classification
600
+ - **Data.world CSV dataset** - Additional URL categorization data
601
+
602
+ ### Manual Dataset Loading
603
+
604
+ You can also load datasets manually for more control over the process:
605
+
606
+ #### Kaggle Dataset Integration
200
607
 
201
608
  Load datasets directly from Kaggle using three authentication methods:
202
609
 
@@ -244,7 +651,7 @@ client.load_kaggle_dataset('owner', 'dataset-name', {
244
651
  categories = client.categorise('https://example.com')
245
652
  ```
246
653
 
247
- ### CSV Dataset Processing
654
+ #### CSV Dataset Processing
248
655
 
249
656
  Load datasets from direct CSV URLs:
250
657
 
@@ -286,7 +693,10 @@ dataset_config = {
286
693
  timeout: 30 # HTTP timeout for downloads
287
694
  }
288
695
 
289
- client = UrlCategorise::Client.new(dataset_config: dataset_config)
696
+ client = UrlCategorise::Client.new(
697
+ dataset_config: dataset_config,
698
+ auto_load_datasets: true # Enable automatic loading of predefined datasets
699
+ )
290
700
  ```
291
701
 
292
702
  ### Disabling Kaggle Functionality
@@ -485,7 +895,18 @@ class UrlCategorizerService
485
895
  cache_dir: Rails.root.join('tmp', 'url_cache'),
486
896
  use_database: true,
487
897
  force_download: Rails.env.development?,
488
- request_timeout: Rails.env.production? ? 30 : 10 # Longer timeout in production
898
+ request_timeout: Rails.env.production? ? 30 : 10, # Longer timeout in production
899
+ iab_compliance: Rails.env.production?, # Enable IAB compliance in production
900
+ iab_version: :v3, # Use IAB Content Taxonomy v3.0
901
+ auto_load_datasets: Rails.env.production?, # Auto-load datasets in production
902
+ dataset_config: {
903
+ kaggle: {
904
+ username: ENV['KAGGLE_USERNAME'],
905
+ api_key: ENV['KAGGLE_API_KEY']
906
+ },
907
+ cache_path: Rails.root.join('tmp', 'dataset_cache'),
908
+ download_path: Rails.root.join('tmp', 'dataset_downloads')
909
+ }
489
910
  )
490
911
  end
491
912
 
@@ -508,12 +929,34 @@ class UrlCategorizerService
508
929
  end
509
930
 
510
931
  def stats
511
- @client.database_stats
932
+ base_stats = @client.database_stats
933
+ base_stats.merge({
934
+ dataset_hosts: @client.count_of_dataset_hosts,
935
+ dataset_categories: @client.count_of_dataset_categories,
936
+ iab_compliant: @client.iab_compliant?,
937
+ iab_version: @client.iab_version
938
+ })
512
939
  end
513
940
 
514
941
  def refresh_lists!
515
942
  @client.update_database
516
943
  end
944
+
945
+ def load_dataset(type, identifier, options = {})
946
+ case type.to_s
947
+ when 'kaggle'
948
+ owner, dataset = identifier.split('/')
949
+ @client.load_kaggle_dataset(owner, dataset, options)
950
+ when 'csv'
951
+ @client.load_csv_dataset(identifier, options)
952
+ else
953
+ raise ArgumentError, "Unsupported dataset type: #{type}"
954
+ end
955
+ end
956
+
957
+ def get_iab_mapping(category)
958
+ @client.get_iab_mapping(category)
959
+ end
517
960
  end
518
961
  ```
519
962
 
data/bin/export_csv ADDED
@@ -0,0 +1,120 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'bundler/setup'
4
+ require 'optparse'
5
+ require_relative '../lib/url_categorise'
6
+
7
+ options = {
8
+ output_path: nil,
9
+ cache_dir: nil,
10
+ verbose: false,
11
+ iab_compliance: false,
12
+ smart_categorization: false,
13
+ auto_load_datasets: false,
14
+ kaggle_credentials_file: nil
15
+ }
16
+
17
+ OptionParser.new do |opts|
18
+ opts.banner = "Usage: #{$0} [options]"
19
+ opts.separator ""
20
+ opts.separator "Export all categorized domains and dataset content as a single CSV file for AI training"
21
+ opts.separator ""
22
+
23
+ opts.on("-o", "--output PATH", "Output directory path (default: cache_dir/exports/csv or ./exports/csv)") do |path|
24
+ options[:output_path] = path
25
+ end
26
+
27
+ opts.on("-c", "--cache-dir PATH", "Cache directory path for client initialization") do |path|
28
+ options[:cache_dir] = path
29
+ end
30
+
31
+ opts.on("--iab-compliance", "Enable IAB compliance for category mapping") do
32
+ options[:iab_compliance] = true
33
+ end
34
+
35
+ opts.on("--smart-categorization", "Enable smart categorization") do
36
+ options[:smart_categorization] = true
37
+ end
38
+
39
+ opts.on("--auto-load-datasets", "Auto-load datasets from constants for rich content export") do
40
+ options[:auto_load_datasets] = true
41
+ end
42
+
43
+ opts.on("--kaggle-credentials FILE", "Path to Kaggle credentials file (default: ~/.kaggle/kaggle.json)") do |file|
44
+ options[:kaggle_credentials_file] = file
45
+ end
46
+
47
+ opts.on("-v", "--verbose", "Verbose output") do
48
+ options[:verbose] = true
49
+ end
50
+
51
+ opts.on("-h", "--help", "Show this help message") do
52
+ puts opts
53
+ exit
54
+ end
55
+ end.parse!
56
+
57
+ puts "=== UrlCategorise CSV Data Export ===" if options[:verbose]
58
+ puts "Initializing client..." if options[:verbose]
59
+
60
+ begin
61
+ # Build dataset config if datasets should be loaded
62
+ dataset_config = {}
63
+ if options[:auto_load_datasets]
64
+ dataset_config = {
65
+ cache_path: options[:cache_dir] ? File.join(options[:cache_dir], 'datasets') : './url_cache/datasets',
66
+ download_path: options[:cache_dir] ? File.join(options[:cache_dir], 'downloads') : './url_cache/downloads'
67
+ }
68
+
69
+ # Add Kaggle credentials if provided
70
+ if options[:kaggle_credentials_file]
71
+ dataset_config[:kaggle] = { credentials_file: options[:kaggle_credentials_file] }
72
+ elsif File.exist?(File.expand_path('~/.kaggle/kaggle.json'))
73
+ dataset_config[:kaggle] = { credentials_file: '~/.kaggle/kaggle.json' }
74
+ end
75
+ end
76
+
77
+ client = UrlCategorise::Client.new(
78
+ cache_dir: options[:cache_dir],
79
+ iab_compliance: options[:iab_compliance],
80
+ smart_categorization: options[:smart_categorization],
81
+ auto_load_datasets: options[:auto_load_datasets],
82
+ dataset_config: dataset_config
83
+ )
84
+
85
+ if options[:verbose] && options[:auto_load_datasets]
86
+ puts "Client initialized with dataset loading enabled"
87
+ puts "Dataset statistics:"
88
+ puts " Dataset categories: #{client.count_of_dataset_categories}"
89
+ puts " Dataset hosts: #{client.count_of_dataset_hosts.to_s.reverse.gsub(/(\\d{3})(?=\\d)/, '\\1,').reverse}"
90
+ puts " Dataset data size: #{client.size_of_dataset_data.round(2)} MB" if client.respond_to?(:size_of_dataset_data)
91
+ end
92
+
93
+ puts "Exporting CSV data..." if options[:verbose]
94
+
95
+ result = client.export_csv_data(options[:output_path])
96
+
97
+ puts "\nāœ… Export completed successfully!"
98
+ puts "šŸ“ Export directory: #{result[:export_directory]}"
99
+ puts "šŸ“„ CSV file: #{result[:csv_file]}"
100
+ puts "šŸ“„ Summary file: #{result[:summary_file]}"
101
+
102
+ puts "\nšŸ“Š Data Summary:"
103
+ puts " Total entries: #{result[:total_entries]}"
104
+ puts " Domain categorizations: #{result[:summary][:domain_categorization_entries]}"
105
+ puts " Dataset content entries: #{result[:summary][:dataset_content_entries]}"
106
+ puts " Total categories: #{result[:summary][:total_categories]}"
107
+ puts " Has dataset content: #{result[:summary][:has_dataset_content]}"
108
+
109
+ if options[:verbose]
110
+ puts "\nšŸ·ļø Categories included:"
111
+ result[:summary][:categories].each do |category|
112
+ puts " - #{category}"
113
+ end
114
+ end
115
+
116
+ rescue StandardError => e
117
+ puts "āŒ Error: #{e.message}"
118
+ puts e.backtrace if options[:verbose]
119
+ exit 1
120
+ end