UrlCategorise 0.1.6 → 0.1.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c1271c16022cb3fb5e9efe9af22deca404204ee54e1b23243c4ebd862565c76b
4
- data.tar.gz: '082814ad46484b87027e0a1d0042f3f37ab20a3d491f4a8e39a585e14fb40ac8'
3
+ metadata.gz: c22ada722efc33979930ece5d8737625943f1d815d2c7acdbfbc0f5b0fede15f
4
+ data.tar.gz: 15747c8a4b23c4c805cd74118627dfa8d102d4d813ae9f08c8c49011ff2ebad8
5
5
  SHA512:
6
- metadata.gz: 8f74b667ba1d993237e224d3db7fcb0323816716961280e5fd27fbf162ecb162f83fd126af89ce812e98fbf8fe48a15c47697b656e345d23465a23ab9913581e
7
- data.tar.gz: 57ed94a3e9a9be6db60c3d48dd9d4ac4da52836951b06a9e6937c44968e5509cda43193491c9ec20039649e01d5bb4a6214f769aab2d6c27c8d047d7b9bd5273
6
+ metadata.gz: b1c21b59b8a7631c4c39f206fb37bb9c51c1925a544ddd5f8a627b595dafa92d366d880a33c6303eafc477ae9be4130055af060953e27617e5ce198618e5f9bd
7
+ data.tar.gz: 1621c52578daf8957546c778125df120f9480dbe2f2d365feb6024890ddfe5affac3aae77100047ea46b15c96939db1fc228b876806197c63dda954199536bc9
@@ -15,8 +15,16 @@
15
15
  "Bash(timeout 30 ruby dataset_loading_example.rb)",
16
16
  "Bash(timeout:*)",
17
17
  "Bash(DEBUG=1 timeout 300 ruby correct_usage_example.rb)",
18
- "Bash(chmod:*)"
18
+ "Bash(chmod:*)",
19
+ "Bash(bundle exec bin/export_csv:*)",
20
+ "Bash(bundle exec rg:*)",
21
+ "WebFetch(domain:github.com)",
22
+ "WebFetch(domain:api.github.com)",
23
+ "WebFetch(domain:raw.githubusercontent.com)",
24
+ "WebSearch",
25
+ "WebFetch(domain:firebog.net)",
26
+ "Bash(curl:*)"
19
27
  ],
20
28
  "deny": []
21
29
  }
22
- }
30
+ }
data/.gitignore CHANGED
@@ -55,3 +55,5 @@ build-iPhoneSimulator/
55
55
  # Used by RuboCop. Remote config files pulled in from inherit_from directive.
56
56
  # .rubocop-https?--*
57
57
  url_cache/*
58
+ exports/*
59
+ .DS_Store
data/.rubocop.yml ADDED
@@ -0,0 +1,56 @@
1
+ inherit_gem:
2
+ rubocop-rails-omakase: rubocop.yml
3
+
4
+ AllCops:
5
+ TargetRubyVersion: 3.0
6
+ NewCops: enable
7
+ Exclude:
8
+ - 'bin/*'
9
+ - 'vendor/**/*'
10
+ - 'tmp/**/*'
11
+ - 'db/migrate/*'
12
+
13
+ # Allow longer lines for URL constants
14
+ Layout/LineLength:
15
+ Max: 200
16
+ Exclude:
17
+ - 'lib/url_categorise/constants.rb'
18
+ - 'test/**/*'
19
+
20
+ # Allow complex methods in client due to categorization logic
21
+ Metrics/MethodLength:
22
+ Max: 30
23
+ Exclude:
24
+ - 'lib/url_categorise/client.rb'
25
+ - 'test/**/*'
26
+
27
+ # Allow complex classes for main client
28
+ Metrics/ClassLength:
29
+ Max: 500
30
+ Exclude:
31
+ - 'lib/url_categorise/client.rb'
32
+ - 'test/**/*'
33
+
34
+ # Allow higher complexity for categorization methods
35
+ Metrics/CyclomaticComplexity:
36
+ Max: 15
37
+ Exclude:
38
+ - 'lib/url_categorise/client.rb'
39
+
40
+ # Allow higher ABC size for complex categorization logic
41
+ Metrics/AbcSize:
42
+ Max: 25
43
+ Exclude:
44
+ - 'lib/url_categorise/client.rb'
45
+ - 'test/**/*'
46
+
47
+ # Allow higher parameter count for client initialization
48
+ Metrics/ParameterLists:
49
+ Max: 8
50
+
51
+ # Allow block length for tests and constants
52
+ Metrics/BlockLength:
53
+ Exclude:
54
+ - 'test/**/*'
55
+ - 'lib/url_categorise/constants.rb'
56
+ - 'url_categorise.gemspec'
data/.sublime-project ADDED
@@ -0,0 +1,14 @@
1
+ {
2
+ "folders": [
3
+ {
4
+ "path": ".",
5
+ "folder_exclude_patterns": [
6
+ "node_modules", ".git", "tmp", "log",
7
+ "vendor/bundle", "coverage", ".bundle"
8
+ ],
9
+ "file_exclude_patterns": [
10
+ ".log", ".min.js", "*.min.css"
11
+ ]
12
+ }
13
+ ]
14
+ }
data/CLAUDE.md CHANGED
@@ -103,6 +103,34 @@ The gem includes automatic monitoring and cleanup of broken URLs:
103
103
 
104
104
  ### New Features (Latest Version)
105
105
 
106
+ #### Video Hosting Detection and Regex Categorization
107
+ Advanced video content detection system with:
108
+
109
+ - **Comprehensive Video Hosting Lists**: Generate PiHole-compatible hosts files from yt-dlp extractors
110
+ - **Regex-Based Content Detection**: Distinguish between video content URLs vs homepages/profiles/playlists
111
+ - **Direct Video URL Detection**: `video_url?` method to check if URLs are direct video content links
112
+ - **Automatic List Generation**: `bin/generate_video_lists` script fetches and processes yt-dlp data
113
+ - **Video Hosting Category**: Separate `video_hosting` category with 3,500+ domains
114
+ - **Smart Content Categorization**: URLs matching video patterns get `*_content` suffix categories
115
+ - **Remote Pattern Files**: Automatically downloads video patterns from GitHub repository
116
+
117
+ ```ruby
118
+ # Enable regex categorization for video content detection (uses remote patterns by default)
119
+ client = UrlCategorise::Client.new(regex_categorization: true)
120
+
121
+ # Basic domain categorization
122
+ client.categorise('https://youtube.com') # => [:video_hosting]
123
+
124
+ # Enhanced content detection
125
+ client.categorise('https://youtube.com/watch?v=abc123') # => [:video_hosting, :video_hosting_content]
126
+
127
+ # Direct video URL detection
128
+ client.video_url?('https://youtube.com/watch?v=abc123') # => true
129
+ client.video_url?('https://youtube.com') # => false
130
+ client.video_url?('https://vimeo.com/123456789') # => true
131
+ client.video_url?('https://tiktok.com/@user/video/123') # => true
132
+ ```
133
+
106
134
  #### Dynamic Settings with ActiveAttr
107
135
  The Client class now uses ActiveAttr to provide dynamic attribute modification:
108
136
 
@@ -141,7 +169,7 @@ result = client.export_hosts_files('/custom/export/path')
141
169
  ```
142
170
 
143
171
  ##### CSV Data Export
144
- Export all data as a single CSV file for AI training and analysis:
172
+ Export all data as a single comprehensive CSV file for AI training and analysis:
145
173
 
146
174
  ```ruby
147
175
  # Export to default location (cache_dir/exports/csv or ./exports/csv)
@@ -150,33 +178,68 @@ result = client.export_csv_data
150
178
  # Export to custom location
151
179
  result = client.export_csv_data('/custom/export/path')
152
180
 
153
- # CSV includes: domain, category, source_type, is_dataset_category, iab_category_v2, iab_category_v3, export_timestamp
154
- # Metadata file includes: export info, client settings, data summary, dataset metadata
181
+ # Returns:
182
+ # {
183
+ # csv_file: 'url_categorise_comprehensive_export_TIMESTAMP.csv',
184
+ # summary_file: 'export_summary_TIMESTAMP.json',
185
+ # total_entries: 75000,
186
+ # summary: { domain_categorization_entries: 50000, dataset_content_entries: 25000 },
187
+ # export_directory: '/export/path'
188
+ # }
155
189
  ```
156
190
 
191
+ **Comprehensive Export Features:**
192
+ - **Everything in One File**: Combined domains, categories, and raw dataset content
193
+ - **Rich Dataset Content**: Original titles, descriptions, summaries, and text from datasets
194
+ - **Dynamic Headers**: Automatically includes all available fields from any dataset
195
+ - **Data Type Tracking**: Distinguishes between processed domains and raw dataset entries
196
+ - **Perfect for AI/ML**: Single file with both structured categorization and rich textual features
197
+
157
198
  #### CLI Commands
158
- New command-line utilities for data export:
199
+ Command-line utilities for comprehensive data export:
159
200
 
160
201
  ```bash
161
- # Export hosts files
202
+ # Export hosts files per category
162
203
  $ bundle exec export_hosts --output /tmp/hosts --verbose
163
204
 
164
- # Export CSV data with IAB compliance
165
- $ bundle exec export_csv --output /tmp/csv --iab-compliance --verbose
205
+ # Full CSV export with datasets and all features
206
+ $ bundle exec export_csv --auto-load-datasets --iab-compliance --smart-categorization --verbose
207
+
208
+ # Custom configuration export
209
+ $ bundle exec export_csv --cache-dir ./custom_cache --kaggle-credentials ~/kaggle.json --output /tmp/export
166
210
 
167
- # Check URL health (existing)
211
+ # Basic domain categorization only
212
+ $ bundle exec export_csv --output /tmp/basic
213
+
214
+ # Health check for all blocklist URLs
168
215
  $ bundle exec check_lists
216
+
217
+ # Generate updated video hosting lists
218
+ $ ruby bin/generate_video_lists
169
219
  ```
170
220
 
221
+ **Enhanced CLI Features:**
222
+ - `--auto-load-datasets`: Automatically load datasets from constants for rich content export
223
+ - `--kaggle-credentials FILE`: Custom Kaggle API credentials file path
224
+ - Full integration with all client features (IAB compliance, smart categorization, etc.)
225
+ - Verbose output shows dataset statistics and loading progress
226
+ - Video list generation from yt-dlp extractors with manual curation
227
+
171
228
  ### List Sources
172
229
  Primary sources include:
173
230
  - The Block List Project
174
231
  - hagezi/dns-blocklists
175
232
  - StevenBlack/hosts
176
233
  - Various specialized security lists
234
+ - **yt-dlp video extractors**: Comprehensive video hosting domain detection (3,500+ domains)
235
+ - **GitHub-hosted video patterns**: Remote video URL detection patterns with manual curation
177
236
  - **Kaggle datasets**: Public URL classification datasets
178
237
  - **Custom CSV files**: Direct CSV dataset URLs with flexible column mapping
179
238
 
239
+ **Video hosting lists are now automatically fetched from:**
240
+ - Video domains: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_hosting_domains.hosts`
241
+ - URL patterns: `https://raw.githubusercontent.com/TRex22/url_categorise/refs/heads/main/lists/video_url_patterns.txt`
242
+
180
243
  ### Testing Guidelines
181
244
  - Mock all HTTP requests using WebMock
182
245
  - Test both success and failure scenarios
data/Gemfile CHANGED
@@ -1,6 +1,6 @@
1
- source 'https://rubygems.org'
1
+ source "https://rubygems.org"
2
2
 
3
- git_source(:github) { |_repo_name| 'https://github.com/TRex22/url_categorise' }
3
+ git_source(:github) { |_repo_name| "https://github.com/TRex22/url_categorise" }
4
4
 
5
5
  # Specify your gem's dependencies in url_categorise.gemspec
6
6
  gemspec
data/Gemfile.lock CHANGED
@@ -1,16 +1,17 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- UrlCategorise (0.1.6)
4
+ UrlCategorise (0.1.9)
5
5
  active_attr (>= 0.17.1, < 1.0)
6
6
  api_pattern (>= 0.0.6, < 1.0)
7
7
  csv (>= 3.3.0, < 4.0)
8
8
  digest (>= 3.1.0, < 4.0)
9
9
  fileutils (>= 1.7.0, < 2.0)
10
- httparty (>= 0.22.0, < 1.0)
10
+ httparty (>= 0.24.0, < 1.0)
11
11
  json (>= 2.7.0, < 3.0)
12
12
  kaggle (>= 0.0.3, < 1.0)
13
- nokogiri (>= 1.18.9, < 2.0)
13
+ nokogiri (>= 1.19.1, < 2.0)
14
+ reline (>= 0.6.2, < 2.0)
14
15
  resolv (>= 0.4.0, < 1.0)
15
16
  rubyzip (>= 2.3.0, < 3.0)
16
17
 
@@ -64,10 +65,14 @@ GEM
64
65
  csv (>= 3.3.0)
65
66
  httparty (>= 0.22.0)
66
67
  nokogiri (>= 1.16.0)
68
+ ast (2.4.3)
67
69
  base64 (0.3.0)
68
70
  benchmark (0.4.1)
69
- bigdecimal (3.2.2)
71
+ bigdecimal (4.1.0)
70
72
  builder (3.3.0)
73
+ bundler-audit (0.9.3)
74
+ bundler (>= 1.2.0)
75
+ thor (~> 1.0)
71
76
  coderay (1.1.3)
72
77
  concurrent-ruby (1.3.5)
73
78
  connection_pool (2.5.3)
@@ -82,12 +87,13 @@ GEM
82
87
  erubi (1.13.1)
83
88
  fileutils (1.7.3)
84
89
  hashdiff (1.2.0)
85
- httparty (0.23.1)
90
+ httparty (0.24.2)
86
91
  csv
87
92
  mini_mime (>= 1.0.0)
88
93
  multi_xml (>= 0.5.2)
89
94
  i18n (1.14.7)
90
95
  concurrent-ruby (~> 1.0)
96
+ io-console (0.8.1)
91
97
  json (2.13.2)
92
98
  kaggle (0.0.3)
93
99
  csv (>= 3.3)
@@ -95,6 +101,8 @@ GEM
95
101
  httparty (>= 0.23)
96
102
  oj (= 3.16.11)
97
103
  rubyzip (>= 2.0)
104
+ language_server-protocol (3.17.0.5)
105
+ lint_roller (1.1.0)
98
106
  logger (1.7.0)
99
107
  loofah (2.24.1)
100
108
  crass (~> 1.0.2)
@@ -111,22 +119,28 @@ GEM
111
119
  ruby-progressbar
112
120
  mocha (2.4.5)
113
121
  ruby2_keywords (>= 0.0.5)
114
- multi_xml (0.7.2)
115
- bigdecimal (~> 3.1)
116
- nokogiri (1.18.9-arm64-darwin)
122
+ multi_xml (0.8.1)
123
+ bigdecimal (>= 3.1, < 5)
124
+ nokogiri (1.19.2-arm64-darwin)
117
125
  racc (~> 1.4)
118
126
  oj (3.16.11)
119
127
  bigdecimal (>= 3.0)
120
128
  ostruct (>= 0.2)
121
129
  ostruct (0.6.3)
130
+ parallel (1.27.0)
131
+ parser (3.3.9.0)
132
+ ast (~> 2.4.1)
133
+ racc
134
+ prism (1.4.0)
122
135
  pry (0.15.2)
123
136
  coderay (~> 1.1)
124
137
  method_source (~> 1.0)
125
138
  public_suffix (6.0.2)
126
139
  racc (1.8.1)
127
- rack (2.2.17)
128
- rack-session (1.0.2)
129
- rack (< 3)
140
+ rack (3.2.5)
141
+ rack-session (2.1.1)
142
+ base64 (>= 0.1.0)
143
+ rack (>= 3.0.0)
130
144
  rack-test (2.2.0)
131
145
  rack (>= 1.3)
132
146
  rails-dom-testing (2.3.0)
@@ -136,9 +150,41 @@ GEM
136
150
  rails-html-sanitizer (1.6.2)
137
151
  loofah (~> 2.21)
138
152
  nokogiri (>= 1.15.7, != 1.16.7, != 1.16.6, != 1.16.5, != 1.16.4, != 1.16.3, != 1.16.2, != 1.16.1, != 1.16.0.rc1, != 1.16.0)
153
+ rainbow (3.1.1)
139
154
  rake (13.3.0)
155
+ regexp_parser (2.11.2)
156
+ reline (0.6.2)
157
+ io-console (~> 0.5)
140
158
  resolv (0.6.2)
141
- rexml (3.4.1)
159
+ rexml (3.4.4)
160
+ rubocop (1.80.1)
161
+ json (~> 2.3)
162
+ language_server-protocol (~> 3.17.0.2)
163
+ lint_roller (~> 1.1.0)
164
+ parallel (~> 1.10)
165
+ parser (>= 3.3.0.2)
166
+ rainbow (>= 2.2.2, < 4.0)
167
+ regexp_parser (>= 2.9.3, < 3.0)
168
+ rubocop-ast (>= 1.46.0, < 2.0)
169
+ ruby-progressbar (~> 1.7)
170
+ unicode-display_width (>= 2.4.0, < 4.0)
171
+ rubocop-ast (1.46.0)
172
+ parser (>= 3.3.7.2)
173
+ prism (~> 1.4)
174
+ rubocop-performance (1.25.0)
175
+ lint_roller (~> 1.1)
176
+ rubocop (>= 1.75.0, < 2.0)
177
+ rubocop-ast (>= 1.38.0, < 2.0)
178
+ rubocop-rails (2.33.3)
179
+ activesupport (>= 4.2.0)
180
+ lint_roller (~> 1.1)
181
+ rack (>= 1.1)
182
+ rubocop (>= 1.75.0, < 2.0)
183
+ rubocop-ast (>= 1.44.0, < 2.0)
184
+ rubocop-rails-omakase (1.1.0)
185
+ rubocop (>= 1.72)
186
+ rubocop-performance (>= 1.24)
187
+ rubocop-rails (>= 2.30)
142
188
  ruby-progressbar (1.13.0)
143
189
  ruby2_keywords (0.0.5)
144
190
  rubyzip (2.4.1)
@@ -150,11 +196,15 @@ GEM
150
196
  simplecov-html (0.13.2)
151
197
  simplecov_json_formatter (0.1.4)
152
198
  sqlite3 (2.7.3-arm64-darwin)
199
+ thor (1.5.0)
153
200
  timecop (0.9.10)
154
201
  timeout (0.4.3)
155
202
  tzinfo (2.0.6)
156
203
  concurrent-ruby (~> 1.0)
157
- uri (1.0.3)
204
+ unicode-display_width (3.1.5)
205
+ unicode-emoji (~> 4.0, >= 4.0.4)
206
+ unicode-emoji (4.0.4)
207
+ uri (1.1.1)
158
208
  useragent (0.16.11)
159
209
  webmock (3.24.0)
160
210
  addressable (>= 2.8.0)
@@ -163,10 +213,12 @@ GEM
163
213
 
164
214
  PLATFORMS
165
215
  arm64-darwin-24
216
+ arm64-darwin-25
166
217
 
167
218
  DEPENDENCIES
168
219
  UrlCategorise!
169
220
  activerecord (>= 8.0)
221
+ bundler-audit (~> 0.9)
170
222
  logger
171
223
  minitest (~> 5.25.5)
172
224
  minitest-focus (~> 1.4.0)
@@ -174,6 +226,7 @@ DEPENDENCIES
174
226
  mocha (~> 2.4.5)
175
227
  pry (~> 0.15.2)
176
228
  rake (~> 13.3.0)
229
+ rubocop-rails-omakase (~> 1.0)
177
230
  simplecov (~> 0.22.0)
178
231
  sqlite3 (>= 2.7)
179
232
  timecop (~> 0.9.10)