UrlCategorise 0.1.3 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: dcb05d79b6bc09b5b338183d412cd309d9634a342c95b14ea25df5926d8609fb
4
- data.tar.gz: effa4c7a010ee574fe6a41653af553a68710ceca46ebdb9dd5352096af5fa7e1
3
+ metadata.gz: e4725733a26240e3bc23bfe9292d3ed0222db4e0308c5228da12a9ede1347e4b
4
+ data.tar.gz: a315a03357b48260c543459fab91ac8ec0601a0d7001b120bdc6f72e543e3671
5
5
  SHA512:
6
- metadata.gz: a527c801cbf6318305d640925dd75922c2ac1bcc76a5de75c75e3ad24698305c3d7d885d7da8dd280e61ceb1fe91a57eac185c5c209e11685f2ddb6833b120b9
7
- data.tar.gz: de81765d20a0b36c54b71b935928777140f86f6e0a130c71ec7804ed28c1b3d8f12ca30fff5cf8c93952aaed7a9c279fe35a0d32bc83a851aad2556b55fd7942
6
+ metadata.gz: 74c8be653a2ae97ad74300105a95c326685d9669defd5c980cf16636e96965080a587453c21d5af6014874145f8f93d66c444574142f1568588f82b9577a7440
7
+ data.tar.gz: 450ae86237aa6a743a289e9b1165b9bae49f78068af8d727114763629b863567dc37de1b2ba38522e93a0ae08dd1683b79a062df2e2d18e76884fd4f8415c1db
@@ -7,10 +7,16 @@
7
7
  "Bash(bundle exec ruby:*)",
8
8
  "Bash(find:*)",
9
9
  "Bash(grep:*)",
10
- "Read(//Users/trex22/development/rubygems/kaggle/**)",
11
10
  "Bash(for file in test/url_categorise/*dataset*test.rb)",
12
11
  "Bash(do echo \"Checking $file...\")",
13
- "Bash(done)"
12
+ "Bash(done)",
13
+ "Bash(bundle exec rubocop:*)",
14
+ "Bash(rubocop:*)",
15
+ "Bash(timeout 30 ruby dataset_loading_example.rb)",
16
+ "Bash(timeout:*)",
17
+ "Bash(DEBUG=1 timeout 300 ruby correct_usage_example.rb)",
18
+ "Bash(chmod:*)",
19
+ "Bash(bundle exec bin/export_csv:*)"
14
20
  ],
15
21
  "deny": []
16
22
  }
data/.gitignore CHANGED
@@ -54,3 +54,5 @@ build-iPhoneSimulator/
54
54
 
55
55
  # Used by RuboCop. Remote config files pulled in from inherit_from directive.
56
56
  # .rubocop-https?--*
57
+ url_cache/*
58
+ exports/*
data/CLAUDE.md CHANGED
@@ -84,14 +84,146 @@ The gem includes automatic monitoring and cleanup of broken URLs:
84
84
  - **Data Hashing**: SHA256 content hashing for dataset change detection
85
85
  - **Category Mapping**: Flexible column detection and category mapping for datasets
86
86
  - **Credential Warnings**: Helpful warnings when Kaggle credentials are missing but functionality continues
87
+ - **IAB Compliance**: Full support for IAB Content Taxonomy v2.0 and v3.0 standards
88
+ - **Dataset-Specific Metrics**: Separate counting methods for dataset vs DNS list categorization
89
+ - **Enhanced Statistics**: Extended helper methods for comprehensive data insights
90
+ - **ActiveAttr Settings**: In-memory modification of client settings using attribute setters
91
+ - **Data Export**: Multiple export formats including hosts files per category and CSV data exports
92
+ - **CLI Commands**: Command-line utilities for data export and list checking
87
93
 
88
94
  ### Architecture
89
- - `Client` class: Main interface for categorization
95
+ - `Client` class: Main interface for categorization with IAB compliance support and ActiveAttr attributes
90
96
  - `DatasetProcessor` class: Handles Kaggle and CSV dataset processing
97
+ - `IabCompliance` module: Maps categories to IAB Content Taxonomy v2.0/v3.0 standards
91
98
  - `Constants` module: Contains default list URLs and categories
92
99
  - `ActiveRecordClient` class: Database-backed client with dataset history
93
100
  - Modular design allows extending with new list sources and datasets
94
- - Support for custom list directories, caching, and dataset integration
101
+ - Support for custom list directories, caching, dataset integration, IAB compliance, and data export
102
+ - ActiveAttr integration for dynamic setting modification and attribute validation
103
+
104
+ ### New Features (Latest Version)
105
+
106
+ #### Video Hosting Detection and Regex Categorization
107
+ Advanced video content detection system with:
108
+
109
+ - **Comprehensive Video Hosting Lists**: Generate PiHole-compatible hosts files from yt-dlp extractors
110
+ - **Regex-Based Content Detection**: Distinguish between video content URLs vs homepages/profiles/playlists
111
+ - **Direct Video URL Detection**: `video_url?` method to check if URLs are direct video content links
112
+ - **Automatic List Generation**: `bin/generate_video_lists` script fetches and processes yt-dlp data
113
+ - **Video Hosting Category**: Separate `video_hosting` category with 3,500+ domains
114
+ - **Smart Content Categorization**: URLs matching video patterns get `*_content` suffix categories
115
+ - **Remote Pattern Files**: Automatically downloads video patterns from GitHub repository
116
+
117
+ ```ruby
118
+ # Enable regex categorization for video content detection (uses remote patterns by default)
119
+ client = UrlCategorise::Client.new(regex_categorization: true)
120
+
121
+ # Basic domain categorization
122
+ client.categorise('https://youtube.com') # => [:video_hosting]
123
+
124
+ # Enhanced content detection
125
+ client.categorise('https://youtube.com/watch?v=abc123') # => [:video_hosting, :video_hosting_content]
126
+
127
+ # Direct video URL detection
128
+ client.video_url?('https://youtube.com/watch?v=abc123') # => true
129
+ client.video_url?('https://youtube.com') # => false
130
+ client.video_url?('https://vimeo.com/123456789') # => true
131
+ client.video_url?('https://tiktok.com/@user/video/123') # => true
132
+ ```
133
+
134
+ #### Dynamic Settings with ActiveAttr
135
+ The Client class now uses ActiveAttr to provide dynamic attribute modification:
136
+
137
+ ```ruby
138
+ client = UrlCategorise::Client.new
139
+
140
+ # Modify settings in-memory
141
+ client.smart_categorization_enabled = true
142
+ client.iab_compliance_enabled = true
143
+ client.iab_version = :v2
144
+ client.request_timeout = 30
145
+ client.dns_servers = ['8.8.8.8', '8.8.4.4']
146
+
147
+ # Settings take effect immediately
148
+ categories = client.categorise('reddit.com') # Uses new smart categorization rules
149
+ ```
150
+
151
+ #### Data Export Features
152
+
153
+ ##### Hosts File Export
154
+ Export all categorized domains as separate hosts files per category:
155
+
156
+ ```ruby
157
+ # Export to default location (cache_dir/exports/hosts or ./exports/hosts)
158
+ result = client.export_hosts_files
159
+
160
+ # Export to custom location
161
+ result = client.export_hosts_files('/custom/export/path')
162
+
163
+ # Returns hash with file information:
164
+ # {
165
+ # malware: { path: '/path/malware.hosts', filename: 'malware.hosts', count: 1500 },
166
+ # advertising: { path: '/path/advertising.hosts', filename: 'advertising.hosts', count: 25000 },
167
+ # _summary: { total_categories: 15, total_domains: 50000, export_directory: '/path' }
168
+ # }
169
+ ```
170
+
171
+ ##### CSV Data Export
172
+ Export all data as a single comprehensive CSV file for AI training and analysis:
173
+
174
+ ```ruby
175
+ # Export to default location (cache_dir/exports/csv or ./exports/csv)
176
+ result = client.export_csv_data
177
+
178
+ # Export to custom location
179
+ result = client.export_csv_data('/custom/export/path')
180
+
181
+ # Returns:
182
+ # {
183
+ # csv_file: 'url_categorise_comprehensive_export_TIMESTAMP.csv',
184
+ # summary_file: 'export_summary_TIMESTAMP.json',
185
+ # total_entries: 75000,
186
+ # summary: { domain_categorization_entries: 50000, dataset_content_entries: 25000 },
187
+ # export_directory: '/export/path'
188
+ # }
189
+ ```
190
+
191
+ **Comprehensive Export Features:**
192
+ - **Everything in One File**: Combined domains, categories, and raw dataset content
193
+ - **Rich Dataset Content**: Original titles, descriptions, summaries, and text from datasets
194
+ - **Dynamic Headers**: Automatically includes all available fields from any dataset
195
+ - **Data Type Tracking**: Distinguishes between processed domains and raw dataset entries
196
+ - **Perfect for AI/ML**: Single file with both structured categorization and rich textual features
197
+
198
+ #### CLI Commands
199
+ Command-line utilities for comprehensive data export:
200
+
201
+ ```bash
202
+ # Export hosts files per category
203
+ $ bundle exec export_hosts --output /tmp/hosts --verbose
204
+
205
+ # Full CSV export with datasets and all features
206
+ $ bundle exec export_csv --auto-load-datasets --iab-compliance --smart-categorization --verbose
207
+
208
+ # Custom configuration export
209
+ $ bundle exec export_csv --cache-dir ./custom_cache --kaggle-credentials ~/kaggle.json --output /tmp/export
210
+
211
+ # Basic domain categorization only
212
+ $ bundle exec export_csv --output /tmp/basic
213
+
214
+ # Health check for all blocklist URLs
215
+ $ bundle exec check_lists
216
+
217
+ # Generate updated video hosting lists
218
+ $ ruby bin/generate_video_lists
219
+ ```
220
+
221
+ **Enhanced CLI Features:**
222
+ - `--auto-load-datasets`: Automatically load datasets from constants for rich content export
223
+ - `--kaggle-credentials FILE`: Custom Kaggle API credentials file path
224
+ - Full integration with all client features (IAB compliance, smart categorization, etc.)
225
+ - Verbose output shows dataset statistics and loading progress
226
+ - Video list generation from yt-dlp extractors with manual curation
95
227
 
96
228
  ### List Sources
97
229
  Primary sources include:
@@ -99,9 +231,15 @@ Primary sources include:
99
231
  - hagezi/dns-blocklists
100
232
  - StevenBlack/hosts
101
233
  - Various specialized security lists
234
+ - **yt-dlp video extractors**: Comprehensive video hosting domain detection (3,500+ domains)
235
+ - **GitHub-hosted video patterns**: Remote video URL detection patterns with manual curation
102
236
  - **Kaggle datasets**: Public URL classification datasets
103
237
  - **Custom CSV files**: Direct CSV dataset URLs with flexible column mapping
104
238
 
239
+ **Video hosting lists are now automatically fetched from:**
240
+ - Video domains: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts`
241
+ - URL patterns: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt`
242
+
105
243
  ### Testing Guidelines
106
244
  - Mock all HTTP requests using WebMock
107
245
  - Test both success and failure scenarios
data/Gemfile.lock CHANGED
@@ -1,14 +1,17 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- UrlCategorise (0.1.3)
4
+ UrlCategorise (0.1.7)
5
+ active_attr (>= 0.17.1, < 1.0)
5
6
  api_pattern (>= 0.0.6, < 1.0)
6
7
  csv (>= 3.3.0, < 4.0)
7
8
  digest (>= 3.1.0, < 4.0)
8
9
  fileutils (>= 1.7.0, < 2.0)
9
10
  httparty (>= 0.22.0, < 1.0)
10
11
  json (>= 2.7.0, < 3.0)
12
+ kaggle (>= 0.0.3, < 1.0)
11
13
  nokogiri (>= 1.18.9, < 2.0)
14
+ reline (>= 0.6.2)
12
15
  resolv (>= 0.4.0, < 1.0)
13
16
  rubyzip (>= 2.3.0, < 3.0)
14
17
 
@@ -86,7 +89,14 @@ GEM
86
89
  multi_xml (>= 0.5.2)
87
90
  i18n (1.14.7)
88
91
  concurrent-ruby (~> 1.0)
92
+ io-console (0.8.1)
89
93
  json (2.13.2)
94
+ kaggle (0.0.3)
95
+ csv (>= 3.3)
96
+ fileutils (>= 1.7)
97
+ httparty (>= 0.23)
98
+ oj (= 3.16.11)
99
+ rubyzip (>= 2.0)
90
100
  logger (1.7.0)
91
101
  loofah (2.24.1)
92
102
  crass (~> 1.0.2)
@@ -107,6 +117,10 @@ GEM
107
117
  bigdecimal (~> 3.1)
108
118
  nokogiri (1.18.9-arm64-darwin)
109
119
  racc (~> 1.4)
120
+ oj (3.16.11)
121
+ bigdecimal (>= 3.0)
122
+ ostruct (>= 0.2)
123
+ ostruct (0.6.3)
110
124
  pry (0.15.2)
111
125
  coderay (~> 1.1)
112
126
  method_source (~> 1.0)
@@ -125,6 +139,8 @@ GEM
125
139
  loofah (~> 2.21)
126
140
  nokogiri (>= 1.15.7, != 1.16.7, != 1.16.6, != 1.16.5, != 1.16.4, != 1.16.3, != 1.16.2, != 1.16.1, != 1.16.0.rc1, != 1.16.0)
127
141
  rake (13.3.0)
142
+ reline (0.6.2)
143
+ io-console (~> 0.5)
128
144
  resolv (0.6.2)
129
145
  rexml (3.4.1)
130
146
  ruby-progressbar (1.13.0)