UrlCategorise 0.1.3 → 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/settings.local.json +8 -2
- data/.gitignore +2 -0
- data/CLAUDE.md +140 -2
- data/Gemfile.lock +17 -1
- data/README.md +450 -7
- data/bin/export_csv +120 -0
- data/bin/export_hosts +68 -0
- data/bin/generate_video_lists +373 -0
- data/bin/rake +2 -0
- data/correct_usage_example.rb +64 -0
- data/docs/v0.1.4-features.md +215 -0
- data/docs/video-url-detection.md +353 -0
- data/lib/url_categorise/active_record_client.rb +1 -1
- data/lib/url_categorise/client.rb +699 -39
- data/lib/url_categorise/constants.rb +9 -6
- data/lib/url_categorise/dataset_processor.rb +27 -10
- data/lib/url_categorise/iab_compliance.rb +149 -0
- data/lib/url_categorise/version.rb +1 -1
- data/lib/url_categorise.rb +2 -0
- data/lists/video_hosting_domains.hosts +7057 -0
- data/lists/video_url_patterns.txt +297 -0
- data/url_categorise.gemspec +5 -2
- metadata +70 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: e4725733a26240e3bc23bfe9292d3ed0222db4e0308c5228da12a9ede1347e4b
|
4
|
+
data.tar.gz: a315a03357b48260c543459fab91ac8ec0601a0d7001b120bdc6f72e543e3671
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 74c8be653a2ae97ad74300105a95c326685d9669defd5c980cf16636e96965080a587453c21d5af6014874145f8f93d66c444574142f1568588f82b9577a7440
|
7
|
+
data.tar.gz: 450ae86237aa6a743a289e9b1165b9bae49f78068af8d727114763629b863567dc37de1b2ba38522e93a0ae08dd1683b79a062df2e2d18e76884fd4f8415c1db
|
data/.claude/settings.local.json
CHANGED
@@ -7,10 +7,16 @@
|
|
7
7
|
"Bash(bundle exec ruby:*)",
|
8
8
|
"Bash(find:*)",
|
9
9
|
"Bash(grep:*)",
|
10
|
-
"Read(//Users/trex22/development/rubygems/kaggle/**)",
|
11
10
|
"Bash(for file in test/url_categorise/*dataset*test.rb)",
|
12
11
|
"Bash(do echo \"Checking $file...\")",
|
13
|
-
"Bash(done)"
|
12
|
+
"Bash(done)",
|
13
|
+
"Bash(bundle exec rubocop:*)",
|
14
|
+
"Bash(rubocop:*)",
|
15
|
+
"Bash(timeout 30 ruby dataset_loading_example.rb)",
|
16
|
+
"Bash(timeout:*)",
|
17
|
+
"Bash(DEBUG=1 timeout 300 ruby correct_usage_example.rb)",
|
18
|
+
"Bash(chmod:*)",
|
19
|
+
"Bash(bundle exec bin/export_csv:*)"
|
14
20
|
],
|
15
21
|
"deny": []
|
16
22
|
}
|
data/.gitignore
CHANGED
data/CLAUDE.md
CHANGED
@@ -84,14 +84,146 @@ The gem includes automatic monitoring and cleanup of broken URLs:
|
|
84
84
|
- **Data Hashing**: SHA256 content hashing for dataset change detection
|
85
85
|
- **Category Mapping**: Flexible column detection and category mapping for datasets
|
86
86
|
- **Credential Warnings**: Helpful warnings when Kaggle credentials are missing but functionality continues
|
87
|
+
- **IAB Compliance**: Full support for IAB Content Taxonomy v2.0 and v3.0 standards
|
88
|
+
- **Dataset-Specific Metrics**: Separate counting methods for dataset vs DNS list categorization
|
89
|
+
- **Enhanced Statistics**: Extended helper methods for comprehensive data insights
|
90
|
+
- **ActiveAttr Settings**: In-memory modification of client settings using attribute setters
|
91
|
+
- **Data Export**: Multiple export formats including hosts files per category and CSV data exports
|
92
|
+
- **CLI Commands**: Command-line utilities for data export and list checking
|
87
93
|
|
88
94
|
### Architecture
|
89
|
-
- `Client` class: Main interface for categorization
|
95
|
+
- `Client` class: Main interface for categorization with IAB compliance support and ActiveAttr attributes
|
90
96
|
- `DatasetProcessor` class: Handles Kaggle and CSV dataset processing
|
97
|
+
- `IabCompliance` module: Maps categories to IAB Content Taxonomy v2.0/v3.0 standards
|
91
98
|
- `Constants` module: Contains default list URLs and categories
|
92
99
|
- `ActiveRecordClient` class: Database-backed client with dataset history
|
93
100
|
- Modular design allows extending with new list sources and datasets
|
94
|
-
- Support for custom list directories, caching, and
|
101
|
+
- Support for custom list directories, caching, dataset integration, IAB compliance, and data export
|
102
|
+
- ActiveAttr integration for dynamic setting modification and attribute validation
|
103
|
+
|
104
|
+
### New Features (Latest Version)
|
105
|
+
|
106
|
+
#### Video Hosting Detection and Regex Categorization
|
107
|
+
Advanced video content detection system with:
|
108
|
+
|
109
|
+
- **Comprehensive Video Hosting Lists**: Generate PiHole-compatible hosts files from yt-dlp extractors
|
110
|
+
- **Regex-Based Content Detection**: Distinguish between video content URLs vs homepages/profiles/playlists
|
111
|
+
- **Direct Video URL Detection**: `video_url?` method to check if URLs are direct video content links
|
112
|
+
- **Automatic List Generation**: `bin/generate_video_lists` script fetches and processes yt-dlp data
|
113
|
+
- **Video Hosting Category**: Separate `video_hosting` category with 3,500+ domains
|
114
|
+
- **Smart Content Categorization**: URLs matching video patterns get `*_content` suffix categories
|
115
|
+
- **Remote Pattern Files**: Automatically downloads video patterns from GitHub repository
|
116
|
+
|
117
|
+
```ruby
|
118
|
+
# Enable regex categorization for video content detection (uses remote patterns by default)
|
119
|
+
client = UrlCategorise::Client.new(regex_categorization: true)
|
120
|
+
|
121
|
+
# Basic domain categorization
|
122
|
+
client.categorise('https://youtube.com') # => [:video_hosting]
|
123
|
+
|
124
|
+
# Enhanced content detection
|
125
|
+
client.categorise('https://youtube.com/watch?v=abc123') # => [:video_hosting, :video_hosting_content]
|
126
|
+
|
127
|
+
# Direct video URL detection
|
128
|
+
client.video_url?('https://youtube.com/watch?v=abc123') # => true
|
129
|
+
client.video_url?('https://youtube.com') # => false
|
130
|
+
client.video_url?('https://vimeo.com/123456789') # => true
|
131
|
+
client.video_url?('https://tiktok.com/@user/video/123') # => true
|
132
|
+
```
|
133
|
+
|
134
|
+
#### Dynamic Settings with ActiveAttr
|
135
|
+
The Client class now uses ActiveAttr to provide dynamic attribute modification:
|
136
|
+
|
137
|
+
```ruby
|
138
|
+
client = UrlCategorise::Client.new
|
139
|
+
|
140
|
+
# Modify settings in-memory
|
141
|
+
client.smart_categorization_enabled = true
|
142
|
+
client.iab_compliance_enabled = true
|
143
|
+
client.iab_version = :v2
|
144
|
+
client.request_timeout = 30
|
145
|
+
client.dns_servers = ['8.8.8.8', '8.8.4.4']
|
146
|
+
|
147
|
+
# Settings take effect immediately
|
148
|
+
categories = client.categorise('reddit.com') # Uses new smart categorization rules
|
149
|
+
```
|
150
|
+
|
151
|
+
#### Data Export Features
|
152
|
+
|
153
|
+
##### Hosts File Export
|
154
|
+
Export all categorized domains as separate hosts files per category:
|
155
|
+
|
156
|
+
```ruby
|
157
|
+
# Export to default location (cache_dir/exports/hosts or ./exports/hosts)
|
158
|
+
result = client.export_hosts_files
|
159
|
+
|
160
|
+
# Export to custom location
|
161
|
+
result = client.export_hosts_files('/custom/export/path')
|
162
|
+
|
163
|
+
# Returns hash with file information:
|
164
|
+
# {
|
165
|
+
# malware: { path: '/path/malware.hosts', filename: 'malware.hosts', count: 1500 },
|
166
|
+
# advertising: { path: '/path/advertising.hosts', filename: 'advertising.hosts', count: 25000 },
|
167
|
+
# _summary: { total_categories: 15, total_domains: 50000, export_directory: '/path' }
|
168
|
+
# }
|
169
|
+
```
|
170
|
+
|
171
|
+
##### CSV Data Export
|
172
|
+
Export all data as a single comprehensive CSV file for AI training and analysis:
|
173
|
+
|
174
|
+
```ruby
|
175
|
+
# Export to default location (cache_dir/exports/csv or ./exports/csv)
|
176
|
+
result = client.export_csv_data
|
177
|
+
|
178
|
+
# Export to custom location
|
179
|
+
result = client.export_csv_data('/custom/export/path')
|
180
|
+
|
181
|
+
# Returns:
|
182
|
+
# {
|
183
|
+
# csv_file: 'url_categorise_comprehensive_export_TIMESTAMP.csv',
|
184
|
+
# summary_file: 'export_summary_TIMESTAMP.json',
|
185
|
+
# total_entries: 75000,
|
186
|
+
# summary: { domain_categorization_entries: 50000, dataset_content_entries: 25000 },
|
187
|
+
# export_directory: '/export/path'
|
188
|
+
# }
|
189
|
+
```
|
190
|
+
|
191
|
+
**Comprehensive Export Features:**
|
192
|
+
- **Everything in One File**: Combined domains, categories, and raw dataset content
|
193
|
+
- **Rich Dataset Content**: Original titles, descriptions, summaries, and text from datasets
|
194
|
+
- **Dynamic Headers**: Automatically includes all available fields from any dataset
|
195
|
+
- **Data Type Tracking**: Distinguishes between processed domains and raw dataset entries
|
196
|
+
- **Perfect for AI/ML**: Single file with both structured categorization and rich textual features
|
197
|
+
|
198
|
+
#### CLI Commands
|
199
|
+
Command-line utilities for comprehensive data export:
|
200
|
+
|
201
|
+
```bash
|
202
|
+
# Export hosts files per category
|
203
|
+
$ bundle exec export_hosts --output /tmp/hosts --verbose
|
204
|
+
|
205
|
+
# Full CSV export with datasets and all features
|
206
|
+
$ bundle exec export_csv --auto-load-datasets --iab-compliance --smart-categorization --verbose
|
207
|
+
|
208
|
+
# Custom configuration export
|
209
|
+
$ bundle exec export_csv --cache-dir ./custom_cache --kaggle-credentials ~/kaggle.json --output /tmp/export
|
210
|
+
|
211
|
+
# Basic domain categorization only
|
212
|
+
$ bundle exec export_csv --output /tmp/basic
|
213
|
+
|
214
|
+
# Health check for all blocklist URLs
|
215
|
+
$ bundle exec check_lists
|
216
|
+
|
217
|
+
# Generate updated video hosting lists
|
218
|
+
$ ruby bin/generate_video_lists
|
219
|
+
```
|
220
|
+
|
221
|
+
**Enhanced CLI Features:**
|
222
|
+
- `--auto-load-datasets`: Automatically load datasets from constants for rich content export
|
223
|
+
- `--kaggle-credentials FILE`: Custom Kaggle API credentials file path
|
224
|
+
- Full integration with all client features (IAB compliance, smart categorization, etc.)
|
225
|
+
- Verbose output shows dataset statistics and loading progress
|
226
|
+
- Video list generation from yt-dlp extractors with manual curation
|
95
227
|
|
96
228
|
### List Sources
|
97
229
|
Primary sources include:
|
@@ -99,9 +231,15 @@ Primary sources include:
|
|
99
231
|
- hagezi/dns-blocklists
|
100
232
|
- StevenBlack/hosts
|
101
233
|
- Various specialized security lists
|
234
|
+
- **yt-dlp video extractors**: Comprehensive video hosting domain detection (3,500+ domains)
|
235
|
+
- **GitHub-hosted video patterns**: Remote video URL detection patterns with manual curation
|
102
236
|
- **Kaggle datasets**: Public URL classification datasets
|
103
237
|
- **Custom CSV files**: Direct CSV dataset URLs with flexible column mapping
|
104
238
|
|
239
|
+
**Video hosting lists are now automatically fetched from:**
|
240
|
+
- Video domains: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts`
|
241
|
+
- URL patterns: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt`
|
242
|
+
|
105
243
|
### Testing Guidelines
|
106
244
|
- Mock all HTTP requests using WebMock
|
107
245
|
- Test both success and failure scenarios
|
data/Gemfile.lock
CHANGED
@@ -1,14 +1,17 @@
|
|
1
1
|
PATH
|
2
2
|
remote: .
|
3
3
|
specs:
|
4
|
-
UrlCategorise (0.1.
|
4
|
+
UrlCategorise (0.1.7)
|
5
|
+
active_attr (>= 0.17.1, < 1.0)
|
5
6
|
api_pattern (>= 0.0.6, < 1.0)
|
6
7
|
csv (>= 3.3.0, < 4.0)
|
7
8
|
digest (>= 3.1.0, < 4.0)
|
8
9
|
fileutils (>= 1.7.0, < 2.0)
|
9
10
|
httparty (>= 0.22.0, < 1.0)
|
10
11
|
json (>= 2.7.0, < 3.0)
|
12
|
+
kaggle (>= 0.0.3, < 1.0)
|
11
13
|
nokogiri (>= 1.18.9, < 2.0)
|
14
|
+
reline (>= 0.6.2)
|
12
15
|
resolv (>= 0.4.0, < 1.0)
|
13
16
|
rubyzip (>= 2.3.0, < 3.0)
|
14
17
|
|
@@ -86,7 +89,14 @@ GEM
|
|
86
89
|
multi_xml (>= 0.5.2)
|
87
90
|
i18n (1.14.7)
|
88
91
|
concurrent-ruby (~> 1.0)
|
92
|
+
io-console (0.8.1)
|
89
93
|
json (2.13.2)
|
94
|
+
kaggle (0.0.3)
|
95
|
+
csv (>= 3.3)
|
96
|
+
fileutils (>= 1.7)
|
97
|
+
httparty (>= 0.23)
|
98
|
+
oj (= 3.16.11)
|
99
|
+
rubyzip (>= 2.0)
|
90
100
|
logger (1.7.0)
|
91
101
|
loofah (2.24.1)
|
92
102
|
crass (~> 1.0.2)
|
@@ -107,6 +117,10 @@ GEM
|
|
107
117
|
bigdecimal (~> 3.1)
|
108
118
|
nokogiri (1.18.9-arm64-darwin)
|
109
119
|
racc (~> 1.4)
|
120
|
+
oj (3.16.11)
|
121
|
+
bigdecimal (>= 3.0)
|
122
|
+
ostruct (>= 0.2)
|
123
|
+
ostruct (0.6.3)
|
110
124
|
pry (0.15.2)
|
111
125
|
coderay (~> 1.1)
|
112
126
|
method_source (~> 1.0)
|
@@ -125,6 +139,8 @@ GEM
|
|
125
139
|
loofah (~> 2.21)
|
126
140
|
nokogiri (>= 1.15.7, != 1.16.7, != 1.16.6, != 1.16.5, != 1.16.4, != 1.16.3, != 1.16.2, != 1.16.1, != 1.16.0.rc1, != 1.16.0)
|
127
141
|
rake (13.3.0)
|
142
|
+
reline (0.6.2)
|
143
|
+
io-console (~> 0.5)
|
128
144
|
resolv (0.6.2)
|
129
145
|
rexml (3.4.1)
|
130
146
|
ruby-progressbar (1.13.0)
|