UrlCategorise 0.1.2 → 0.1.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/settings.local.json +10 -1
- data/.gitignore +1 -0
- data/CLAUDE.md +88 -3
- data/Gemfile +2 -2
- data/Gemfile.lock +18 -9
- data/README.md +517 -4
- data/Rakefile +8 -8
- data/bin/check_lists +12 -13
- data/bin/console +3 -3
- data/bin/export_csv +83 -0
- data/bin/export_hosts +68 -0
- data/bin/rake +2 -0
- data/correct_usage_example.rb +64 -0
- data/docs/v0.1.4-features.md +215 -0
- data/lib/url_categorise/active_record_client.rb +98 -21
- data/lib/url_categorise/client.rb +641 -134
- data/lib/url_categorise/constants.rb +86 -71
- data/lib/url_categorise/dataset_processor.rb +476 -0
- data/lib/url_categorise/iab_compliance.rb +147 -0
- data/lib/url_categorise/models.rb +53 -14
- data/lib/url_categorise/version.rb +1 -1
- data/lib/url_categorise.rb +3 -0
- data/url_categorise.gemspec +37 -33
- metadata +142 -52
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: c1271c16022cb3fb5e9efe9af22deca404204ee54e1b23243c4ebd862565c76b
|
4
|
+
data.tar.gz: '082814ad46484b87027e0a1d0042f3f37ab20a3d491f4a8e39a585e14fb40ac8'
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8f74b667ba1d993237e224d3db7fcb0323816716961280e5fd27fbf162ecb162f83fd126af89ce812e98fbf8fe48a15c47697b656e345d23465a23ab9913581e
|
7
|
+
data.tar.gz: 57ed94a3e9a9be6db60c3d48dd9d4ac4da52836951b06a9e6937c44968e5509cda43193491c9ec20039649e01d5bb4a6214f769aab2d6c27c8d047d7b9bd5273
|
data/.claude/settings.local.json
CHANGED
@@ -6,7 +6,16 @@
|
|
6
6
|
"Bash(ruby:*)",
|
7
7
|
"Bash(bundle exec ruby:*)",
|
8
8
|
"Bash(find:*)",
|
9
|
-
"Bash(grep:*)"
|
9
|
+
"Bash(grep:*)",
|
10
|
+
"Bash(for file in test/url_categorise/*dataset*test.rb)",
|
11
|
+
"Bash(do echo \"Checking $file...\")",
|
12
|
+
"Bash(done)",
|
13
|
+
"Bash(bundle exec rubocop:*)",
|
14
|
+
"Bash(rubocop:*)",
|
15
|
+
"Bash(timeout 30 ruby dataset_loading_example.rb)",
|
16
|
+
"Bash(timeout:*)",
|
17
|
+
"Bash(DEBUG=1 timeout 300 ruby correct_usage_example.rb)",
|
18
|
+
"Bash(chmod:*)"
|
10
19
|
],
|
11
20
|
"deny": []
|
12
21
|
}
|
data/.gitignore
CHANGED
data/CLAUDE.md
CHANGED
@@ -78,12 +78,95 @@ The gem includes automatic monitoring and cleanup of broken URLs:
|
|
78
78
|
- ActiveRecord/Rails integration (optional)
|
79
79
|
- URL health monitoring and reporting
|
80
80
|
- Automatic cleanup of broken blocklist sources
|
81
|
+
- **Dataset Processing**: Kaggle and CSV dataset integration with three auth methods
|
82
|
+
- **Optional Kaggle**: Can disable Kaggle functionality entirely while keeping CSV processing
|
83
|
+
- **Smart Caching**: Cached datasets work without credentials, avoiding unnecessary authentication
|
84
|
+
- **Data Hashing**: SHA256 content hashing for dataset change detection
|
85
|
+
- **Category Mapping**: Flexible column detection and category mapping for datasets
|
86
|
+
- **Credential Warnings**: Helpful warnings when Kaggle credentials are missing but functionality continues
|
87
|
+
- **IAB Compliance**: Full support for IAB Content Taxonomy v2.0 and v3.0 standards
|
88
|
+
- **Dataset-Specific Metrics**: Separate counting methods for dataset vs DNS list categorization
|
89
|
+
- **Enhanced Statistics**: Extended helper methods for comprehensive data insights
|
90
|
+
- **ActiveAttr Settings**: In-memory modification of client settings using attribute setters
|
91
|
+
- **Data Export**: Multiple export formats including hosts files per category and CSV data exports
|
92
|
+
- **CLI Commands**: Command-line utilities for data export and list checking
|
81
93
|
|
82
94
|
### Architecture
|
83
|
-
- `Client` class: Main interface for categorization
|
95
|
+
- `Client` class: Main interface for categorization with IAB compliance support and ActiveAttr attributes
|
96
|
+
- `DatasetProcessor` class: Handles Kaggle and CSV dataset processing
|
97
|
+
- `IabCompliance` module: Maps categories to IAB Content Taxonomy v2.0/v3.0 standards
|
84
98
|
- `Constants` module: Contains default list URLs and categories
|
85
|
-
-
|
86
|
-
-
|
99
|
+
- `ActiveRecordClient` class: Database-backed client with dataset history
|
100
|
+
- Modular design allows extending with new list sources and datasets
|
101
|
+
- Support for custom list directories, caching, dataset integration, IAB compliance, and data export
|
102
|
+
- ActiveAttr integration for dynamic setting modification and attribute validation
|
103
|
+
|
104
|
+
### New Features (Latest Version)
|
105
|
+
|
106
|
+
#### Dynamic Settings with ActiveAttr
|
107
|
+
The Client class now uses ActiveAttr to provide dynamic attribute modification:
|
108
|
+
|
109
|
+
```ruby
|
110
|
+
client = UrlCategorise::Client.new
|
111
|
+
|
112
|
+
# Modify settings in-memory
|
113
|
+
client.smart_categorization_enabled = true
|
114
|
+
client.iab_compliance_enabled = true
|
115
|
+
client.iab_version = :v2
|
116
|
+
client.request_timeout = 30
|
117
|
+
client.dns_servers = ['8.8.8.8', '8.8.4.4']
|
118
|
+
|
119
|
+
# Settings take effect immediately
|
120
|
+
categories = client.categorise('reddit.com') # Uses new smart categorization rules
|
121
|
+
```
|
122
|
+
|
123
|
+
#### Data Export Features
|
124
|
+
|
125
|
+
##### Hosts File Export
|
126
|
+
Export all categorized domains as separate hosts files per category:
|
127
|
+
|
128
|
+
```ruby
|
129
|
+
# Export to default location (cache_dir/exports/hosts or ./exports/hosts)
|
130
|
+
result = client.export_hosts_files
|
131
|
+
|
132
|
+
# Export to custom location
|
133
|
+
result = client.export_hosts_files('/custom/export/path')
|
134
|
+
|
135
|
+
# Returns hash with file information:
|
136
|
+
# {
|
137
|
+
# malware: { path: '/path/malware.hosts', filename: 'malware.hosts', count: 1500 },
|
138
|
+
# advertising: { path: '/path/advertising.hosts', filename: 'advertising.hosts', count: 25000 },
|
139
|
+
# _summary: { total_categories: 15, total_domains: 50000, export_directory: '/path' }
|
140
|
+
# }
|
141
|
+
```
|
142
|
+
|
143
|
+
##### CSV Data Export
|
144
|
+
Export all data as a single CSV file for AI training and analysis:
|
145
|
+
|
146
|
+
```ruby
|
147
|
+
# Export to default location (cache_dir/exports/csv or ./exports/csv)
|
148
|
+
result = client.export_csv_data
|
149
|
+
|
150
|
+
# Export to custom location
|
151
|
+
result = client.export_csv_data('/custom/export/path')
|
152
|
+
|
153
|
+
# CSV includes: domain, category, source_type, is_dataset_category, iab_category_v2, iab_category_v3, export_timestamp
|
154
|
+
# Metadata file includes: export info, client settings, data summary, dataset metadata
|
155
|
+
```
|
156
|
+
|
157
|
+
#### CLI Commands
|
158
|
+
New command-line utilities for data export:
|
159
|
+
|
160
|
+
```bash
|
161
|
+
# Export hosts files
|
162
|
+
$ bundle exec export_hosts --output /tmp/hosts --verbose
|
163
|
+
|
164
|
+
# Export CSV data with IAB compliance
|
165
|
+
$ bundle exec export_csv --output /tmp/csv --iab-compliance --verbose
|
166
|
+
|
167
|
+
# Check URL health (existing)
|
168
|
+
$ bundle exec check_lists
|
169
|
+
```
|
87
170
|
|
88
171
|
### List Sources
|
89
172
|
Primary sources include:
|
@@ -91,6 +174,8 @@ Primary sources include:
|
|
91
174
|
- hagezi/dns-blocklists
|
92
175
|
- StevenBlack/hosts
|
93
176
|
- Various specialized security lists
|
177
|
+
- **Kaggle datasets**: Public URL classification datasets
|
178
|
+
- **Custom CSV files**: Direct CSV dataset URLs with flexible column mapping
|
94
179
|
|
95
180
|
### Testing Guidelines
|
96
181
|
- Mock all HTTP requests using WebMock
|
data/Gemfile
CHANGED
@@ -1,6 +1,6 @@
|
|
1
|
-
source
|
1
|
+
source 'https://rubygems.org'
|
2
2
|
|
3
|
-
git_source(:github) {|
|
3
|
+
git_source(:github) { |_repo_name| 'https://github.com/TRex22/url_categorise' }
|
4
4
|
|
5
5
|
# Specify your gem's dependencies in url_categorise.gemspec
|
6
6
|
gemspec
|
data/Gemfile.lock
CHANGED
@@ -1,14 +1,18 @@
|
|
1
1
|
PATH
|
2
2
|
remote: .
|
3
3
|
specs:
|
4
|
-
UrlCategorise (0.1.
|
4
|
+
UrlCategorise (0.1.6)
|
5
|
+
active_attr (>= 0.17.1, < 1.0)
|
5
6
|
api_pattern (>= 0.0.6, < 1.0)
|
6
7
|
csv (>= 3.3.0, < 4.0)
|
7
8
|
digest (>= 3.1.0, < 4.0)
|
8
9
|
fileutils (>= 1.7.0, < 2.0)
|
9
10
|
httparty (>= 0.22.0, < 1.0)
|
11
|
+
json (>= 2.7.0, < 3.0)
|
12
|
+
kaggle (>= 0.0.3, < 1.0)
|
10
13
|
nokogiri (>= 1.18.9, < 2.0)
|
11
14
|
resolv (>= 0.4.0, < 1.0)
|
15
|
+
rubyzip (>= 2.3.0, < 3.0)
|
12
16
|
|
13
17
|
GEM
|
14
18
|
remote: https://rubygems.org/
|
@@ -78,19 +82,25 @@ GEM
|
|
78
82
|
erubi (1.13.1)
|
79
83
|
fileutils (1.7.3)
|
80
84
|
hashdiff (1.2.0)
|
81
|
-
httparty (0.
|
85
|
+
httparty (0.23.1)
|
82
86
|
csv
|
83
87
|
mini_mime (>= 1.0.0)
|
84
88
|
multi_xml (>= 0.5.2)
|
85
89
|
i18n (1.14.7)
|
86
90
|
concurrent-ruby (~> 1.0)
|
91
|
+
json (2.13.2)
|
92
|
+
kaggle (0.0.3)
|
93
|
+
csv (>= 3.3)
|
94
|
+
fileutils (>= 1.7)
|
95
|
+
httparty (>= 0.23)
|
96
|
+
oj (= 3.16.11)
|
97
|
+
rubyzip (>= 2.0)
|
87
98
|
logger (1.7.0)
|
88
99
|
loofah (2.24.1)
|
89
100
|
crass (~> 1.0.2)
|
90
101
|
nokogiri (>= 1.12.0)
|
91
102
|
method_source (1.1.0)
|
92
103
|
mini_mime (1.1.5)
|
93
|
-
mini_portile2 (2.8.9)
|
94
104
|
minitest (5.25.5)
|
95
105
|
minitest-focus (1.4.0)
|
96
106
|
minitest (>= 4, < 6)
|
@@ -103,11 +113,12 @@ GEM
|
|
103
113
|
ruby2_keywords (>= 0.0.5)
|
104
114
|
multi_xml (0.7.2)
|
105
115
|
bigdecimal (~> 3.1)
|
106
|
-
nokogiri (1.18.9)
|
107
|
-
mini_portile2 (~> 2.8.2)
|
108
|
-
racc (~> 1.4)
|
109
116
|
nokogiri (1.18.9-arm64-darwin)
|
110
117
|
racc (~> 1.4)
|
118
|
+
oj (3.16.11)
|
119
|
+
bigdecimal (>= 3.0)
|
120
|
+
ostruct (>= 0.2)
|
121
|
+
ostruct (0.6.3)
|
111
122
|
pry (0.15.2)
|
112
123
|
coderay (~> 1.1)
|
113
124
|
method_source (~> 1.0)
|
@@ -130,6 +141,7 @@ GEM
|
|
130
141
|
rexml (3.4.1)
|
131
142
|
ruby-progressbar (1.13.0)
|
132
143
|
ruby2_keywords (0.0.5)
|
144
|
+
rubyzip (2.4.1)
|
133
145
|
securerandom (0.4.1)
|
134
146
|
simplecov (0.22.0)
|
135
147
|
docile (~> 1.1)
|
@@ -137,8 +149,6 @@ GEM
|
|
137
149
|
simplecov_json_formatter (~> 0.1)
|
138
150
|
simplecov-html (0.13.2)
|
139
151
|
simplecov_json_formatter (0.1.4)
|
140
|
-
sqlite3 (2.7.3)
|
141
|
-
mini_portile2 (~> 2.8.0)
|
142
152
|
sqlite3 (2.7.3-arm64-darwin)
|
143
153
|
timecop (0.9.10)
|
144
154
|
timeout (0.4.3)
|
@@ -153,7 +163,6 @@ GEM
|
|
153
163
|
|
154
164
|
PLATFORMS
|
155
165
|
arm64-darwin-24
|
156
|
-
ruby
|
157
166
|
|
158
167
|
DEPENDENCIES
|
159
168
|
UrlCategorise!
|