RubyGems - UrlCategorise - Versions diffs - 0.1.2 → 0.1.6 - Mend

UrlCategorise 0.1.2 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

checksums.yaml +4 -4
data/.claude/settings.local.json +10 -1
data/.gitignore +1 -0
data/CLAUDE.md +88 -3
data/Gemfile +2 -2
data/Gemfile.lock +18 -9
data/README.md +517 -4
data/Rakefile +8 -8
data/bin/check_lists +12 -13
data/bin/console +3 -3
data/bin/export_csv +83 -0
data/bin/export_hosts +68 -0
data/bin/rake +2 -0
data/correct_usage_example.rb +64 -0
data/docs/v0.1.4-features.md +215 -0
data/lib/url_categorise/active_record_client.rb +98 -21
data/lib/url_categorise/client.rb +641 -134
data/lib/url_categorise/constants.rb +86 -71
data/lib/url_categorise/dataset_processor.rb +476 -0
data/lib/url_categorise/iab_compliance.rb +147 -0
data/lib/url_categorise/models.rb +53 -14
data/lib/url_categorise/version.rb +1 -1
data/lib/url_categorise.rb +3 -0
data/url_categorise.gemspec +37 -33
metadata +142 -52

data/README.md CHANGED Viewed

@@ -5,6 +5,8 @@ A comprehensive Ruby gem for categorizing URLs and domains based on various secu
 ## Features
 - **Comprehensive Coverage**: 60+ high-quality categories including security, content, and specialized lists
+- **Kaggle Dataset Integration**: Automatic loading and processing of machine learning datasets from Kaggle
+- **Multiple Data Sources**: Supports blocklists, CSV datasets, and Kaggle ML datasets
 - **Multiple List Formats**: Supports hosts files, pfSense, AdSense, uBlock Origin, dnsmasq, and plain text formats
 - **Intelligent Caching**: Hash-based file update detection with configurable local cache
 - **DNS Resolution**: Resolve domains to IPs and check against IP-based blocklists
@@ -14,6 +16,10 @@ A comprehensive Ruby gem for categorizing URLs and domains based on various secu
 - **Metadata Tracking**: Track last update times, ETags, and content hashes
 - **Health Monitoring**: Automatic detection and removal of broken blocklist sources
 - **List Validation**: Built-in tools to verify all configured URLs are accessible
+- **Auto-Loading Datasets**: Automatic processing of predefined datasets during client initialization
+- **ActiveAttr Settings**: In-memory modification of client settings using attribute setters
+- **Data Export**: Export categorized data as hosts files per category or comprehensive CSV exports
+- **CLI Commands**: Command-line utilities for data export and list checking
 ## Installation
@@ -44,6 +50,15 @@ puts "Total hosts: #{client.count_of_hosts}"
 puts "Categories: #{client.count_of_categories}"
 puts "Data size: #{client.size_of_data} MB"
+# Get detailed size breakdown
+puts "Total data size: #{client.size_of_data} MB (#{client.size_of_data_bytes} bytes)"
+puts "Blocklist data size: #{client.size_of_blocklist_data} MB (#{client.size_of_blocklist_data_bytes} bytes)"
+puts "Dataset data size: #{client.size_of_dataset_data} MB (#{client.size_of_dataset_data_bytes} bytes)"
+# Get dataset-specific statistics (if datasets are loaded)
+puts "Dataset hosts: #{client.count_of_dataset_hosts}"
+puts "Dataset categories: #{client.count_of_dataset_categories}"
 # Categorize a URL or domain
 categories = client.categorise("badsite.com")
 puts "Categories: #{categories}" # => [:malware, :phishing]
@@ -57,6 +72,83 @@ ip_categories = client.categorise_ip("192.168.1.100")
 puts "IP categories: #{ip_categories}"
 ```
+## New Features
+### Dynamic Settings with ActiveAttr
+The Client class now supports in-memory modification of settings using ActiveAttr:
+```ruby
+client = UrlCategorise::Client.new
+# Modify settings dynamically
+client.smart_categorization_enabled = true
+client.iab_compliance_enabled = true
+client.iab_version = :v2
+client.request_timeout = 30
+client.dns_servers = ['8.8.8.8', '8.8.4.4']
+# Settings take effect immediately - no need to recreate the client
+categories = client.categorise('reddit.com') # Uses new smart categorization rules
+```
+### Data Export Features
+#### Hosts File Export
+Export all categorized domains as separate hosts files per category:
+```ruby
+# Export to default location
+result = client.export_hosts_files
+# Export to custom location
+result = client.export_hosts_files('/custom/export/path')
+# Result includes file information and summary
+puts "Exported #{result[:_summary][:total_categories]} categories"
+puts "Total domains: #{result[:_summary][:total_domains]}"
+puts "Files saved to: #{result[:_summary][:export_directory]}"
+```
+Each category gets its own hosts file (e.g., `malware.hosts`, `advertising.hosts`) with proper headers and sorted domains.
+#### CSV Data Export
+Export all data as a single CSV file for AI training and analysis:
+```ruby
+# Export to default location
+result = client.export_csv_data
+# Export to custom location with IAB compliance
+client.iab_compliance_enabled = true
+result = client.export_csv_data('/custom/export/path')
+# CSV includes comprehensive data:
+# - domain, category, source_type, is_dataset_category
+# - iab_category_v2, iab_category_v3, export_timestamp
+# - smart_categorization_enabled
+# Metadata file includes:
+# - Export info, client settings, data summary, dataset metadata
+```
+#### CLI Commands
+New command-line utilities for data export:
+```bash
+# Export hosts files
+$ bundle exec export_hosts --output /tmp/hosts --verbose
+# Export CSV data with IAB compliance
+$ bundle exec export_csv --output /tmp/csv --iab-compliance --verbose
+# Check URL health (existing command)
+$ bundle exec check_lists
+```
 ## Advanced Configuration
 ### File Caching
@@ -113,7 +205,11 @@ client = UrlCategorise::Client.new(
   cache_dir: "./url_cache",                                # Enable local caching
   force_download: false,                                   # Use cache when available
   dns_servers: ['1.1.1.1', '1.0.0.1'],                   # Cloudflare DNS servers
-  request_timeout: 15                                      # 15 second HTTP timeout
+  request_timeout: 15,                                     # 15 second HTTP timeout
+  iab_compliance: true,                                    # Enable IAB compliance
+  iab_version: :v3,                                        # Use IAB Content Taxonomy v3.0
+  auto_load_datasets: false,                               # Disable automatic dataset loading (default)
+  smart_categorization: false                              # Disable smart post-processing (default)
 )
 ```
@@ -132,6 +228,165 @@ host_urls = {
 client = UrlCategorise::Client.new(host_urls: host_urls)
 ```
+### Smart Categorization (Post-Processing)
+Smart categorization solves the problem of overly broad domain-level categorization. For example, `reddit.com` might appear in health & fitness blocklists, but not all Reddit content is health-related.
+#### The Problem
+```ruby
+# Without smart categorization
+client.categorise("reddit.com")
+# => [:reddit, :social_media, :health_and_fitness, :forums]  # Too broad!
+client.categorise("reddit.com/r/technology")
+# => [:reddit, :social_media, :health_and_fitness, :forums]  # Still wrong!
+```
+#### The Solution
+```ruby
+# Enable smart categorization
+client = UrlCategorise::Client.new(
+  smart_categorization: true  # Remove overly broad categories
+)
+client.categorise("reddit.com")
+# => [:reddit, :social_media]  # Much more accurate!
+```
+#### How It Works
+Smart categorization automatically removes overly broad categories for known platforms:
+- **Social Media Platforms** (Reddit, Facebook, Twitter, etc.): Removes categories like `:health_and_fitness`, `:forums`, `:news`, `:technology`, `:education`
+- **Search Engines** (Google, Bing, etc.): Removes categories like `:news`, `:shopping`, `:travel`
+- **Video Platforms** (YouTube, Vimeo, etc.): Removes categories like `:education`, `:entertainment`, `:music`
+#### Custom Smart Rules
+You can define custom rules for specific domains or URL patterns:
+```ruby
+custom_rules = {
+  reddit_subreddits: {
+    domains: ['reddit.com'],
+    remove_categories: [:health_and_fitness, :forums],
+    add_categories_by_path: {
+      /\/r\/fitness/ => [:health_and_fitness],      # Add back for /r/fitness
+      /\/r\/technology/ => [:technology],           # Add technology for /r/technology
+      /\/r\/programming/ => [:technology, :programming]
+    }
+  },
+  my_company_domains: {
+    domains: ['mycompany.com'],
+    allowed_categories_only: [:business, :technology]  # Only allow specific categories
+  }
+}
+client = UrlCategorise::Client.new(
+  smart_categorization: true,
+  smart_rules: custom_rules
+)
+# Now path-based categorization works
+client.categorise('reddit.com')           # => [:reddit, :social_media]
+client.categorise('reddit.com/r/fitness') # => [:reddit, :social_media, :health_and_fitness]
+client.categorise('reddit.com/r/technology') # => [:reddit, :social_media, :technology]
+```
+#### Available Rule Types
+- **`remove_categories`**: Remove specific categories for domains
+- **`keep_primary_only`**: Keep only specified categories, remove others
+- **`allowed_categories_only`**: Only allow specific categories, block all others
+- **`add_categories_by_path`**: Add categories based on URL path patterns
+#### Smart Rules with IAB Compliance
+Smart categorization works seamlessly with IAB compliance:
+```ruby
+client = UrlCategorise::Client.new(
+  smart_categorization: true,
+  iab_compliance: true,
+  iab_version: :v3
+)
+# Returns clean IAB codes after smart processing
+categories = client.categorise("reddit.com")  # => ["14"] (Society - Social Media)
+```
+## IAB Content Taxonomy Compliance
+UrlCategorise supports IAB (Interactive Advertising Bureau) Content Taxonomy compliance for standardized content categorization:
+### Basic IAB Compliance
+```ruby
+# Enable IAB v3.0 compliance (default)
+client = UrlCategorise::Client.new(
+  iab_compliance: true,
+  iab_version: :v3
+)
+# Enable IAB v2.0 compliance
+client = UrlCategorise::Client.new(
+  iab_compliance: true,
+  iab_version: :v2
+)
+# Categorization returns IAB codes instead of custom categories
+categories = client.categorise("badsite.com")
+puts categories # => ["626"] (IAB v3 code for illegal content)
+# Check IAB compliance status
+puts client.iab_compliant? # => true
+# Get IAB mapping for a specific category
+puts client.get_iab_mapping(:malware) # => "626" (v3) or "IAB25" (v2)
+```
+### IAB Category Mappings
+The gem maps security and content categories to appropriate IAB codes:
+**IAB Content Taxonomy v3.0 (recommended):**
+- `malware`, `phishing`, `illegal` → `626` (Illegal Content)
+- `advertising`, `mobile_ads` → `3` (Advertising)
+- `gambling` → `7-39` (Gambling)
+- `pornography` → `626` (Adult Content)
+- `social_media` → `14` (Society)
+- `technology` → `19` (Technology & Computing)
+**IAB Content Taxonomy v2.0:**
+- `malware`, `phishing` → `IAB25` (Non-Standard Content)
+- `advertising` → `IAB3` (Advertising)
+- `gambling` → `IAB7-39` (Gambling)
+- `pornography` → `IAB25-3` (Pornography)
+### Integration with Datasets
+IAB compliance works seamlessly with dataset processing:
+```ruby
+client = UrlCategorise::Client.new(
+  iab_compliance: true,
+  iab_version: :v3,
+  dataset_config: {
+    kaggle: { username: 'user', api_key: 'key' }
+  },
+  auto_load_datasets: true  # Automatically load predefined datasets with IAB mapping
+)
+# Load additional datasets - categories will be mapped to IAB codes
+client.load_kaggle_dataset('owner', 'dataset-name')
+client.load_csv_dataset('https://example.com/data.csv')
+# All categorization methods return IAB codes
+categories = client.categorise("example.com") # => ["3", "626"]
+```
 ## Available Categories
 ### Security & Threat Intelligence
@@ -192,6 +447,196 @@ ruby bin/check_lists
 [View all 60+ categories in constants.rb](lib/url_categorise/constants.rb)
+## Dataset Processing
+UrlCategorise supports processing external datasets from Kaggle and CSV files to expand categorization data beyond traditional blocklists. This allows integration of machine learning datasets and custom URL classification data:
+### Automatic Dataset Loading
+Enable automatic loading of predefined datasets during client initialization:
+```ruby
+# Enable automatic dataset loading from constants
+client = UrlCategorise::Client.new(
+  dataset_config: {
+    kaggle: {
+      username: ENV['KAGGLE_USERNAME'],
+      api_key: ENV['KAGGLE_API_KEY']
+    },
+    cache_path: './dataset_cache',
+    download_path: './downloads'
+  },
+  auto_load_datasets: true  # Automatically loads all predefined datasets
+)
+# Datasets are now automatically integrated and ready for use
+categories = client.categorise('https://example.com')
+puts "Dataset categories loaded: #{client.count_of_dataset_categories}"
+puts "Dataset hosts: #{client.count_of_dataset_hosts}"
+```
+The gem includes predefined high-quality datasets in constants:
+- **`shaurov/website-classification-using-url`** - Comprehensive URL classification dataset
+- **`hetulmehta/website-classification`** - Website categorization with cleaned text data
+- **`shawon10/url-classification-dataset-dmoz`** - DMOZ-based URL classification
+- **Data.world CSV dataset** - Additional URL categorization data
+### Manual Dataset Loading
+You can also load datasets manually for more control over the process:
+#### Kaggle Dataset Integration
+Load datasets directly from Kaggle using three authentication methods:
+```ruby
+# Method 1: Environment variables (KAGGLE_USERNAME, KAGGLE_KEY)
+client = UrlCategorise::Client.new(
+  dataset_config: {
+    kaggle: {}  # Will use environment variables
+  }
+)
+# Method 2: Explicit credentials
+client = UrlCategorise::Client.new(
+  dataset_config: {
+    kaggle: {
+      username: 'your_username',
+      api_key: 'your_api_key'
+    }
+  }
+)
+# Method 3: Credentials file (~/.kaggle/kaggle.json or custom path)
+client = UrlCategorise::Client.new(
+  dataset_config: {
+    kaggle: {
+      credentials_file: '/path/to/kaggle.json'
+    }
+  }
+)
+# Load and integrate a Kaggle dataset
+client.load_kaggle_dataset('owner', 'dataset-name', {
+  use_cache: true,  # Cache processed data
+  category_mappings: {
+    url_column: 'website',      # Column containing URLs/domains
+    category_column: 'type',    # Column containing categories
+    category_map: {
+      'malicious' => 'malware', # Map dataset categories to your categories
+      'spam' => 'phishing'
+    }
+  }
+})
+# Check categorization with dataset data
+categories = client.categorise('https://example.com')
+```
+#### CSV Dataset Processing
+Load datasets from direct CSV URLs:
+```ruby
+client = UrlCategorise::Client.new(
+  dataset_config: {
+    download_path: './datasets',
+    cache_path: './dataset_cache'
+  }
+)
+# Load CSV dataset
+client.load_csv_dataset('https://example.com/url-classification.csv', {
+  use_cache: true,
+  category_mappings: {
+    url_column: 'url',
+    category_column: 'category'
+  }
+})
+```
+### Dataset Configuration Options
+```ruby
+dataset_config = {
+  # Kaggle functionality control
+  enable_kaggle: true,              # Set to false to disable Kaggle entirely (default: true)
+  # Kaggle authentication (optional - will try env vars and default file)
+  kaggle: {
+    username: 'kaggle_username',     # Or use KAGGLE_USERNAME env var
+    api_key: 'kaggle_api_key',       # Or use KAGGLE_KEY env var
+    credentials_file: '~/.kaggle/kaggle.json'  # Optional custom path
+  },
+  # File paths
+  download_path: './downloads',      # Where to store downloads
+  cache_path: './cache',            # Where to cache processed data
+  timeout: 30                       # HTTP timeout for downloads
+}
+client = UrlCategorise::Client.new(
+  dataset_config: dataset_config,
+  auto_load_datasets: true          # Enable automatic loading of predefined datasets
+)
+```
+### Disabling Kaggle Functionality
+You can completely disable Kaggle functionality if you only need CSV processing:
+```ruby
+# Disable Kaggle - only CSV datasets will work
+client = UrlCategorise::Client.new(
+  dataset_config: {
+    enable_kaggle: false,
+    download_path: './datasets',
+    cache_path: './dataset_cache'
+  }
+)
+# This will raise an error
+# client.load_kaggle_dataset('owner', 'dataset')  # Error!
+# But CSV datasets still work
+client.load_csv_dataset('https://example.com/data.csv')
+```
+### Working with Cached Datasets
+If you have cached datasets, you can access them even without Kaggle credentials:
+```ruby
+# No credentials provided, but cached data will work
+client = UrlCategorise::Client.new(
+  dataset_config: {
+    kaggle: {},  # Empty config - will show warning but continue
+    download_path: './datasets',
+    cache_path: './cache'
+  }
+)
+# Will work if data is cached, otherwise will show helpful error message
+client.load_kaggle_dataset('owner', 'dataset', use_cache: true)
+```
+### Dataset Metadata and Hashing
+The system automatically tracks dataset metadata and generates content hashes:
+```ruby
+# Get dataset metadata
+metadata = client.dataset_metadata
+metadata.each do |data_hash, meta|
+  puts "Dataset hash: #{data_hash}"
+  puts "Processed at: #{meta[:processed_at]}"
+  puts "Total entries: #{meta[:total_entries]}"
+end
+# Reload client with fresh dataset integration
+client.reload_with_datasets
+```
 ## ActiveRecord Integration
 For high-performance applications, enable database storage:
@@ -215,11 +660,31 @@ categories = client.categorise("example.com")
 # Get database statistics
 stats = client.database_stats
-# => { domains: 50000, ip_addresses: 15000, categories: 45, list_metadata: 90 }
+# => { domains: 50000, ip_addresses: 15000, categories: 45, list_metadata: 90, dataset_metadata: 5 }
 # Direct model access
 domain_record = UrlCategorise::Models::Domain.find_by(domain: "example.com")
 ip_record = UrlCategorise::Models::IpAddress.find_by(ip_address: "1.2.3.4")
+# Dataset integration with ActiveRecord
+client = UrlCategorise::ActiveRecordClient.new(
+  use_database: true,
+  dataset_config: {
+    kaggle: { username: 'user', api_key: 'key' }
+  }
+)
+# Load datasets - automatically stored in database
+client.load_kaggle_dataset('owner', 'dataset')
+client.load_csv_dataset('https://example.com/data.csv')
+# View dataset history
+history = client.dataset_history(limit: 5)
+# => [{ source_type: 'kaggle', identifier: 'owner/dataset', total_entries: 1000, processed_at: ... }]
+# Filter by source type
+kaggle_history = client.dataset_history(source_type: 'kaggle')
+csv_history = client.dataset_history(source_type: 'csv')
 ```
 ## Rails Integration
@@ -274,6 +739,21 @@ class CreateUrlCategoriseTables < ActiveRecord::Migration[7.0]
     add_index :url_categorise_ip_addresses, :ip_address
     add_index :url_categorise_ip_addresses, :categories
+    create_table :url_categorise_dataset_metadata do |t|
+      t.string :source_type, null: false, index: true
+      t.string :identifier, null: false
+      t.string :data_hash, null: false, index: { unique: true }
+      t.integer :total_entries, null: false
+      t.text :category_mappings
+      t.text :processing_options
+      t.datetime :processed_at
+      t.timestamps
+    end
+    add_index :url_categorise_dataset_metadata, :source_type
+    add_index :url_categorise_dataset_metadata, :identifier
+    add_index :url_categorise_dataset_metadata, :processed_at
   end
 end
 ```
@@ -297,7 +777,18 @@ class UrlCategorizerService
       cache_dir: Rails.root.join('tmp', 'url_cache'),
       use_database: true,
       force_download: Rails.env.development?,
-      request_timeout: Rails.env.production? ? 30 : 10  # Longer timeout in production
+      request_timeout: Rails.env.production? ? 30 : 10,  # Longer timeout in production
+      iab_compliance: Rails.env.production?,              # Enable IAB compliance in production
+      iab_version: :v3,                                   # Use IAB Content Taxonomy v3.0
+      auto_load_datasets: Rails.env.production?,          # Auto-load datasets in production
+      dataset_config: {
+        kaggle: {
+          username: ENV['KAGGLE_USERNAME'],
+          api_key: ENV['KAGGLE_API_KEY']
+        },
+        cache_path: Rails.root.join('tmp', 'dataset_cache'),
+        download_path: Rails.root.join('tmp', 'dataset_downloads')
+      }
     )
   end
@@ -320,12 +811,34 @@ class UrlCategorizerService
   end
   def stats
-    @client.database_stats
+    base_stats = @client.database_stats
+    base_stats.merge({
+      dataset_hosts: @client.count_of_dataset_hosts,
+      dataset_categories: @client.count_of_dataset_categories,
+      iab_compliant: @client.iab_compliant?,
+      iab_version: @client.iab_version
+    })
   end
   def refresh_lists!
     @client.update_database
   end
+  def load_dataset(type, identifier, options = {})
+    case type.to_s
+    when 'kaggle'
+      owner, dataset = identifier.split('/')
+      @client.load_kaggle_dataset(owner, dataset, options)
+    when 'csv'
+      @client.load_csv_dataset(identifier, options)
+    else
+      raise ArgumentError, "Unsupported dataset type: #{type}"
+    end
+  end
+  def get_iab_mapping(category)
+    @client.get_iab_mapping(category)
+  end
 end
 ```

data/Rakefile CHANGED Viewed

@@ -1,12 +1,12 @@
-require "bundler/gem_tasks"
-require "bundler/setup"
-require "rake/testtask"
+require 'bundler/gem_tasks'
+require 'bundler/setup'
+require 'rake/testtask'
 Rake::TestTask.new(:test) do |t|
-  t.libs << "test"
-  t.libs << "lib"
-  t.test_files = FileList["test/**/*_test.rb"]
-  t.ruby_opts = ["-rbundler/setup"]
+  t.libs << 'test'
+  t.libs << 'lib'
+  t.test_files = FileList['test/**/*_test.rb']
+  t.ruby_opts = ['-rbundler/setup']
 end
-task :default => :test
+task default: :test

data/bin/check_lists CHANGED Viewed

@@ -3,46 +3,45 @@
 require 'bundler/setup'
 require_relative '../lib/url_categorise'
-puts "=== CHECKING ALL URLs IN CONSTANTS ==="
+puts '=== CHECKING ALL URLs IN CONSTANTS ==='
 UrlCategorise::Constants::DEFAULT_HOST_URLS.each do |category, urls|
   puts "\n#{category.upcase}:"
   # Skip categories that only reference other categories (symbols)
   actual_urls = urls.reject { |url| url.is_a?(Symbol) }
   if actual_urls.empty?
     if urls.empty?
-      puts "  Empty category (no URLs defined)"
+      puts '  Empty category (no URLs defined)'
     else
       puts "  Only references other categories: #{urls}"
     end
     next
   end
   actual_urls.each do |url|
     print "  Testing #{url}... "
     begin
       response = HTTParty.head(url, timeout: 10)
       case response.code
       when 200
-        puts "✅ OK"
+        puts '✅ OK'
       when 404
-        puts "❌ 404 Not Found"
+        puts '❌ 404 Not Found'
       when 403
-        puts "❌ 403 Forbidden"
+        puts '❌ 403 Forbidden'
       when 500..599
         puts "❌ Server Error (#{response.code})"
       else
         puts "⚠️ HTTP #{response.code}"
       end
     rescue Net::TimeoutError, HTTParty::TimeoutError
-      puts "❌ Timeout"
-    rescue SocketError, Errno::ECONNREFUSED => e
-      puts "❌ DNS/Network Error"
-    rescue => e
+      puts '❌ Timeout'
+    rescue SocketError, Errno::ECONNREFUSED
+      puts '❌ DNS/Network Error'
+    rescue StandardError => e
       puts "❌ Error: #{e.class}"
     end
   end
 end

data/bin/console CHANGED Viewed

@@ -1,11 +1,11 @@
 #!/usr/bin/env ruby
-require "bundler/setup"
-require "url_categorise"
+require 'bundler/setup'
+require 'url_categorise'
 # You can add fixtures and/or initialization code here to make experimenting
 # with your gem easier. You can also use a different console, if you like.
 # (If you use this, don't forget to add pry to your Gemfile!)
-require "pry"
+require 'pry'
 Pry.start