UrlCategorise 0.1.2 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 74ab2a721a954a91958dc1c184fd76a4a2c3acd250730bb7db58c30286579fcd
4
- data.tar.gz: 3b01bf42b654266dddcd85d684c2f3e25d455d0fe9b82094f07de2b95c026bd3
3
+ metadata.gz: c1271c16022cb3fb5e9efe9af22deca404204ee54e1b23243c4ebd862565c76b
4
+ data.tar.gz: '082814ad46484b87027e0a1d0042f3f37ab20a3d491f4a8e39a585e14fb40ac8'
5
5
  SHA512:
6
- metadata.gz: 2d175a1f72e2fda10771c770a694f07235cd7f5a3f93d1b487d82c98b4f7b490f7aaf665834dcf226bc1a6ae2cc833038731e8d9581af0a570302a7c17dad1c1
7
- data.tar.gz: 2bd46865492bea3411b8506f294cf96b507c5c5beafad2584b2438977043fbfd3850e79c0a0342b11079829ef5bf3fd91c18c3f45ad5e0537fae8c78c9645245
6
+ metadata.gz: 8f74b667ba1d993237e224d3db7fcb0323816716961280e5fd27fbf162ecb162f83fd126af89ce812e98fbf8fe48a15c47697b656e345d23465a23ab9913581e
7
+ data.tar.gz: 57ed94a3e9a9be6db60c3d48dd9d4ac4da52836951b06a9e6937c44968e5509cda43193491c9ec20039649e01d5bb4a6214f769aab2d6c27c8d047d7b9bd5273
@@ -6,7 +6,16 @@
6
6
  "Bash(ruby:*)",
7
7
  "Bash(bundle exec ruby:*)",
8
8
  "Bash(find:*)",
9
- "Bash(grep:*)"
9
+ "Bash(grep:*)",
10
+ "Bash(for file in test/url_categorise/*dataset*test.rb)",
11
+ "Bash(do echo \"Checking $file...\")",
12
+ "Bash(done)",
13
+ "Bash(bundle exec rubocop:*)",
14
+ "Bash(rubocop:*)",
15
+ "Bash(timeout 30 ruby dataset_loading_example.rb)",
16
+ "Bash(timeout:*)",
17
+ "Bash(DEBUG=1 timeout 300 ruby correct_usage_example.rb)",
18
+ "Bash(chmod:*)"
10
19
  ],
11
20
  "deny": []
12
21
  }
data/.gitignore CHANGED
@@ -54,3 +54,4 @@ build-iPhoneSimulator/
54
54
 
55
55
  # Used by RuboCop. Remote config files pulled in from inherit_from directive.
56
56
  # .rubocop-https?--*
57
+ url_cache/*
data/CLAUDE.md CHANGED
@@ -78,12 +78,95 @@ The gem includes automatic monitoring and cleanup of broken URLs:
78
78
  - ActiveRecord/Rails integration (optional)
79
79
  - URL health monitoring and reporting
80
80
  - Automatic cleanup of broken blocklist sources
81
+ - **Dataset Processing**: Kaggle and CSV dataset integration with three auth methods
82
+ - **Optional Kaggle**: Can disable Kaggle functionality entirely while keeping CSV processing
83
+ - **Smart Caching**: Cached datasets work without credentials, avoiding unnecessary authentication
84
+ - **Data Hashing**: SHA256 content hashing for dataset change detection
85
+ - **Category Mapping**: Flexible column detection and category mapping for datasets
86
+ - **Credential Warnings**: Helpful warnings when Kaggle credentials are missing but functionality continues
87
+ - **IAB Compliance**: Full support for IAB Content Taxonomy v2.0 and v3.0 standards
88
+ - **Dataset-Specific Metrics**: Separate counting methods for dataset vs DNS list categorization
89
+ - **Enhanced Statistics**: Extended helper methods for comprehensive data insights
90
+ - **ActiveAttr Settings**: In-memory modification of client settings using attribute setters
91
+ - **Data Export**: Multiple export formats including hosts files per category and CSV data exports
92
+ - **CLI Commands**: Command-line utilities for data export and list checking
81
93
 
82
94
  ### Architecture
83
- - `Client` class: Main interface for categorization
95
+ - `Client` class: Main interface for categorization with IAB compliance support and ActiveAttr attributes
96
+ - `DatasetProcessor` class: Handles Kaggle and CSV dataset processing
97
+ - `IabCompliance` module: Maps categories to IAB Content Taxonomy v2.0/v3.0 standards
84
98
  - `Constants` module: Contains default list URLs and categories
85
- - Modular design allows extending with new list sources
86
- - Support for custom list directories and caching
99
+ - `ActiveRecordClient` class: Database-backed client with dataset history
100
+ - Modular design allows extending with new list sources and datasets
101
+ - Support for custom list directories, caching, dataset integration, IAB compliance, and data export
102
+ - ActiveAttr integration for dynamic setting modification and attribute validation
103
+
104
+ ### New Features (Latest Version)
105
+
106
+ #### Dynamic Settings with ActiveAttr
107
+ The Client class now uses ActiveAttr to provide dynamic attribute modification:
108
+
109
+ ```ruby
110
+ client = UrlCategorise::Client.new
111
+
112
+ # Modify settings in-memory
113
+ client.smart_categorization_enabled = true
114
+ client.iab_compliance_enabled = true
115
+ client.iab_version = :v2
116
+ client.request_timeout = 30
117
+ client.dns_servers = ['8.8.8.8', '8.8.4.4']
118
+
119
+ # Settings take effect immediately
120
+ categories = client.categorise('reddit.com') # Uses new smart categorization rules
121
+ ```
122
+
123
+ #### Data Export Features
124
+
125
+ ##### Hosts File Export
126
+ Export all categorized domains as separate hosts files per category:
127
+
128
+ ```ruby
129
+ # Export to default location (cache_dir/exports/hosts or ./exports/hosts)
130
+ result = client.export_hosts_files
131
+
132
+ # Export to custom location
133
+ result = client.export_hosts_files('/custom/export/path')
134
+
135
+ # Returns hash with file information:
136
+ # {
137
+ # malware: { path: '/path/malware.hosts', filename: 'malware.hosts', count: 1500 },
138
+ # advertising: { path: '/path/advertising.hosts', filename: 'advertising.hosts', count: 25000 },
139
+ # _summary: { total_categories: 15, total_domains: 50000, export_directory: '/path' }
140
+ # }
141
+ ```
142
+
143
+ ##### CSV Data Export
144
+ Export all data as a single CSV file for AI training and analysis:
145
+
146
+ ```ruby
147
+ # Export to default location (cache_dir/exports/csv or ./exports/csv)
148
+ result = client.export_csv_data
149
+
150
+ # Export to custom location
151
+ result = client.export_csv_data('/custom/export/path')
152
+
153
+ # CSV includes: domain, category, source_type, is_dataset_category, iab_category_v2, iab_category_v3, export_timestamp
154
+ # Metadata file includes: export info, client settings, data summary, dataset metadata
155
+ ```
156
+
157
+ #### CLI Commands
158
+ New command-line utilities for data export:
159
+
160
+ ```bash
161
+ # Export hosts files
162
+ $ bundle exec export_hosts --output /tmp/hosts --verbose
163
+
164
+ # Export CSV data with IAB compliance
165
+ $ bundle exec export_csv --output /tmp/csv --iab-compliance --verbose
166
+
167
+ # Check URL health (existing)
168
+ $ bundle exec check_lists
169
+ ```
87
170
 
88
171
  ### List Sources
89
172
  Primary sources include:
@@ -91,6 +174,8 @@ Primary sources include:
91
174
  - hagezi/dns-blocklists
92
175
  - StevenBlack/hosts
93
176
  - Various specialized security lists
177
+ - **Kaggle datasets**: Public URL classification datasets
178
+ - **Custom CSV files**: Direct CSV dataset URLs with flexible column mapping
94
179
 
95
180
  ### Testing Guidelines
96
181
  - Mock all HTTP requests using WebMock
data/Gemfile CHANGED
@@ -1,6 +1,6 @@
1
- source "https://rubygems.org"
1
+ source 'https://rubygems.org'
2
2
 
3
- git_source(:github) {|repo_name| "https://github.com/TRex22/url_categorise" }
3
+ git_source(:github) { |_repo_name| 'https://github.com/TRex22/url_categorise' }
4
4
 
5
5
  # Specify your gem's dependencies in url_categorise.gemspec
6
6
  gemspec
data/Gemfile.lock CHANGED
@@ -1,14 +1,18 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- UrlCategorise (0.1.2)
4
+ UrlCategorise (0.1.6)
5
+ active_attr (>= 0.17.1, < 1.0)
5
6
  api_pattern (>= 0.0.6, < 1.0)
6
7
  csv (>= 3.3.0, < 4.0)
7
8
  digest (>= 3.1.0, < 4.0)
8
9
  fileutils (>= 1.7.0, < 2.0)
9
10
  httparty (>= 0.22.0, < 1.0)
11
+ json (>= 2.7.0, < 3.0)
12
+ kaggle (>= 0.0.3, < 1.0)
10
13
  nokogiri (>= 1.18.9, < 2.0)
11
14
  resolv (>= 0.4.0, < 1.0)
15
+ rubyzip (>= 2.3.0, < 3.0)
12
16
 
13
17
  GEM
14
18
  remote: https://rubygems.org/
@@ -78,19 +82,25 @@ GEM
78
82
  erubi (1.13.1)
79
83
  fileutils (1.7.3)
80
84
  hashdiff (1.2.0)
81
- httparty (0.22.0)
85
+ httparty (0.23.1)
82
86
  csv
83
87
  mini_mime (>= 1.0.0)
84
88
  multi_xml (>= 0.5.2)
85
89
  i18n (1.14.7)
86
90
  concurrent-ruby (~> 1.0)
91
+ json (2.13.2)
92
+ kaggle (0.0.3)
93
+ csv (>= 3.3)
94
+ fileutils (>= 1.7)
95
+ httparty (>= 0.23)
96
+ oj (= 3.16.11)
97
+ rubyzip (>= 2.0)
87
98
  logger (1.7.0)
88
99
  loofah (2.24.1)
89
100
  crass (~> 1.0.2)
90
101
  nokogiri (>= 1.12.0)
91
102
  method_source (1.1.0)
92
103
  mini_mime (1.1.5)
93
- mini_portile2 (2.8.9)
94
104
  minitest (5.25.5)
95
105
  minitest-focus (1.4.0)
96
106
  minitest (>= 4, < 6)
@@ -103,11 +113,12 @@ GEM
103
113
  ruby2_keywords (>= 0.0.5)
104
114
  multi_xml (0.7.2)
105
115
  bigdecimal (~> 3.1)
106
- nokogiri (1.18.9)
107
- mini_portile2 (~> 2.8.2)
108
- racc (~> 1.4)
109
116
  nokogiri (1.18.9-arm64-darwin)
110
117
  racc (~> 1.4)
118
+ oj (3.16.11)
119
+ bigdecimal (>= 3.0)
120
+ ostruct (>= 0.2)
121
+ ostruct (0.6.3)
111
122
  pry (0.15.2)
112
123
  coderay (~> 1.1)
113
124
  method_source (~> 1.0)
@@ -130,6 +141,7 @@ GEM
130
141
  rexml (3.4.1)
131
142
  ruby-progressbar (1.13.0)
132
143
  ruby2_keywords (0.0.5)
144
+ rubyzip (2.4.1)
133
145
  securerandom (0.4.1)
134
146
  simplecov (0.22.0)
135
147
  docile (~> 1.1)
@@ -137,8 +149,6 @@ GEM
137
149
  simplecov_json_formatter (~> 0.1)
138
150
  simplecov-html (0.13.2)
139
151
  simplecov_json_formatter (0.1.4)
140
- sqlite3 (2.7.3)
141
- mini_portile2 (~> 2.8.0)
142
152
  sqlite3 (2.7.3-arm64-darwin)
143
153
  timecop (0.9.10)
144
154
  timeout (0.4.3)
@@ -153,7 +163,6 @@ GEM
153
163
 
154
164
  PLATFORMS
155
165
  arm64-darwin-24
156
- ruby
157
166
 
158
167
  DEPENDENCIES
159
168
  UrlCategorise!