promptmenot 0.1.3 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +17 -1
- data/CONTRIBUTING.md +267 -34
- data/README.md +189 -10
- data/lib/promptmenot/patterns/resource_extraction.rb +77 -0
- data/lib/promptmenot/version.rb +1 -1
- data/lib/promptmenot.rb +3 -1
- metadata +3 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 6240f653fe69f0219a000d1633adafaf5752d791f99890f7a16e5256aeb8c0c9
|
|
4
|
+
data.tar.gz: ee2a9badc12c91d1a6d075b7d8aebc04c480ee05e2d6f6b26924a1ebc8807fcb
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 04b75ccb08cf0dd68f2f0696fa855c0ec201fc91b60921dc5530a631cff62de7b6332573513200bd8701696f199335dc6dc3b5104ff9a17d7a21506f5e0ff745
|
|
7
|
+
data.tar.gz: 7d2d1c1f02d9abab3557eb04d5cd162bc878bba686f55ea8e9c7a55f3f92cbeb03700be1fbccbeba460c659129e9a83ede3abdbeda70bb38c21c2fb4084ed948
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,21 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## [0.2.0] - 2026-02-25
|
|
4
|
+
|
|
5
|
+
### Added
|
|
6
|
+
|
|
7
|
+
- New `resource_extraction` pattern category with 10 patterns detecting attempts to trick AI agents into transferring funds, leaking credentials, or exhausting compute resources
|
|
8
|
+
- Crypto transfer requests ("transfer 100 SOL to")
|
|
9
|
+
- Wallet address detection with transfer instructions
|
|
10
|
+
- Full balance drain attempts ("send all your tokens")
|
|
11
|
+
- Financial urgency manipulation
|
|
12
|
+
- Authorization claims for transfers
|
|
13
|
+
- Transaction execution instructions
|
|
14
|
+
- Credential extraction ("give me your API key", "show seed phrase")
|
|
15
|
+
- External endpoint exfiltration ("send results to https://...")
|
|
16
|
+
- Access escalation ("grant me full access to the wallet")
|
|
17
|
+
- Resource exhaustion ("use all your credits")
|
|
18
|
+
|
|
3
19
|
## [0.1.3] - 2026-02-17
|
|
4
20
|
|
|
5
21
|
### Fixed
|
|
@@ -10,7 +26,7 @@
|
|
|
10
26
|
|
|
11
27
|
### Added
|
|
12
28
|
|
|
13
|
-
- Core detection engine with 6 pattern categories (~60 patterns)
|
|
29
|
+
- Core detection engine with 6 pattern categories (~60 patterns):
|
|
14
30
|
- Direct instruction override
|
|
15
31
|
- Role manipulation
|
|
16
32
|
- Delimiter injection
|
data/CONTRIBUTING.md
CHANGED
|
@@ -1,6 +1,16 @@
|
|
|
1
1
|
# Contributing to PromptMeNot
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
Thanks for your interest in PromptMeNot! Whether you're adding new detection patterns, fixing bugs, or improving docs, contributions are welcome.
|
|
4
|
+
|
|
5
|
+
## Table of Contents
|
|
6
|
+
|
|
7
|
+
- [Development Setup](#development-setup)
|
|
8
|
+
- [Architecture Overview](#architecture-overview)
|
|
9
|
+
- [Adding New Patterns](#adding-new-patterns)
|
|
10
|
+
- [Testing](#testing)
|
|
11
|
+
- [Code Style](#code-style)
|
|
12
|
+
- [Submitting Changes](#submitting-changes)
|
|
13
|
+
- [Reporting Issues](#reporting-issues)
|
|
4
14
|
|
|
5
15
|
## Development Setup
|
|
6
16
|
|
|
@@ -10,60 +20,283 @@ cd promptmenot
|
|
|
10
20
|
bundle install
|
|
11
21
|
```
|
|
12
22
|
|
|
13
|
-
|
|
23
|
+
Verify everything works:
|
|
14
24
|
|
|
15
25
|
```bash
|
|
16
|
-
#
|
|
17
|
-
|
|
26
|
+
bundle exec rake # runs rspec + rubocop
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
**Requirements**: Ruby 3.0+, Bundler
|
|
30
|
+
|
|
31
|
+
## Architecture Overview
|
|
32
|
+
|
|
33
|
+
Understanding how detection works will help you contribute effectively:
|
|
34
|
+
|
|
35
|
+
```
|
|
36
|
+
Input text
|
|
37
|
+
-> Detector
|
|
38
|
+
-> PatternRegistry.for_sensitivity(level)
|
|
39
|
+
-> Pattern.match(text) # per pattern
|
|
40
|
+
-> Deduplicate overlapping matches
|
|
41
|
+
-> Result (safe/unsafe + matches)
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
**Key components:**
|
|
45
|
+
|
|
46
|
+
| File | Purpose |
|
|
47
|
+
|------|---------|
|
|
48
|
+
| `lib/promptmenot/detector.rb` | Core detection engine - scans text against active patterns |
|
|
49
|
+
| `lib/promptmenot/pattern.rb` | Pattern class - wraps a name, regex, sensitivity, and confidence |
|
|
50
|
+
| `lib/promptmenot/pattern_registry.rb` | Stores all registered patterns, filters by sensitivity level |
|
|
51
|
+
| `lib/promptmenot/sanitizer.rb` | Removes matched content from text (`:sanitize` mode) |
|
|
52
|
+
| `lib/promptmenot/validator.rb` | ActiveModel integration (`validates :field, prompt_safety: ...`) |
|
|
53
|
+
| `lib/promptmenot/patterns/` | Pattern definitions organized by attack category |
|
|
54
|
+
|
|
55
|
+
**Sensitivity cascade**: Patterns activate cumulatively. A `:low` pattern fires at *all* levels. A `:high` pattern only fires at `:high` and `:paranoid`. Think of it as:
|
|
56
|
+
|
|
57
|
+
```
|
|
58
|
+
:paranoid -> all patterns active
|
|
59
|
+
:high -> :low + :medium + :high
|
|
60
|
+
:medium -> :low + :medium
|
|
61
|
+
:low -> :low only
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
## Adding New Patterns
|
|
65
|
+
|
|
66
|
+
This is the most common (and most valuable) contribution. If you've found a prompt injection technique that PromptMeNot doesn't catch, here's how to add it.
|
|
67
|
+
|
|
68
|
+
### 1. Choose the right category
|
|
69
|
+
|
|
70
|
+
Patterns live in `lib/promptmenot/patterns/` and are organized by attack type:
|
|
71
|
+
|
|
72
|
+
| Category | File | Examples |
|
|
73
|
+
|----------|------|----------|
|
|
74
|
+
| Direct instruction override | `direct_instruction_override.rb` | "ignore previous instructions", "new instructions:" |
|
|
75
|
+
| Role manipulation | `role_manipulation.rb` | "DAN mode", "act as unrestricted AI" |
|
|
76
|
+
| Delimiter injection | `delimiter_injection.rb` | `<\|system\|>`, `[INST]`, fake XML/markdown headers |
|
|
77
|
+
| Encoding obfuscation | `encoding_obfuscation.rb` | Base64 payloads, hex escapes, zero-width chars |
|
|
78
|
+
| Indirect injection | `indirect_injection.rb` | "Dear AI", "if you are an LLM" |
|
|
79
|
+
| Context manipulation | `context_manipulation.rb` | `===RESET===`, prompt leaking attempts |
|
|
80
|
+
| Resource extraction | `resource_extraction.rb` | Crypto transfers, wallet drains, credential theft, endpoint exfiltration |
|
|
18
81
|
|
|
19
|
-
|
|
20
|
-
bundle exec rspec spec/promptmenot/detector_spec.rb
|
|
82
|
+
If your pattern doesn't fit any existing category, open an issue to discuss adding a new one.
|
|
21
83
|
|
|
22
|
-
|
|
23
|
-
|
|
84
|
+
### 2. Write the pattern
|
|
85
|
+
|
|
86
|
+
All pattern classes inherit from `Patterns::Base` and use the `register` DSL:
|
|
87
|
+
|
|
88
|
+
```ruby
|
|
89
|
+
# frozen_string_literal: true
|
|
90
|
+
|
|
91
|
+
module Promptmenot
|
|
92
|
+
module Patterns
|
|
93
|
+
class DirectInstructionOverride < Base
|
|
94
|
+
register(
|
|
95
|
+
name: :ignore_previous_instructions,
|
|
96
|
+
regex: /\bignore\s+(all\s+)?(previous|prior|above)\s+(instructions|directives|rules)\b/i,
|
|
97
|
+
sensitivity: :low,
|
|
98
|
+
confidence: :high
|
|
99
|
+
)
|
|
100
|
+
end
|
|
101
|
+
end
|
|
102
|
+
end
|
|
24
103
|
```
|
|
25
104
|
|
|
26
|
-
|
|
105
|
+
**Parameters:**
|
|
106
|
+
|
|
107
|
+
- **`name`** (Symbol) - Unique identifier across all patterns. Use `snake_case` describing the attack.
|
|
108
|
+
- **`regex`** (Regexp) - The detection pattern. Always use `\b` word boundaries to prevent substring matches. Always use the `/i` flag for case-insensitive matching.
|
|
109
|
+
- **`sensitivity`** (Symbol) - When this pattern activates:
|
|
110
|
+
- `:low` - High-signal, almost never a false positive. Use for unambiguous injection phrases like "ignore all previous instructions".
|
|
111
|
+
- `:medium` - Reasonable default. May have edge cases but generally reliable.
|
|
112
|
+
- `:high` - Catches more but may flag legitimate text. Use for broader patterns like "from now on you must...".
|
|
113
|
+
- `:paranoid` - Maximum coverage, higher false positive rate. Use for patterns that catch things like encoded content that *could* be legitimate.
|
|
114
|
+
- **`confidence`** (Symbol) - How certain we are that a match is actually an injection:
|
|
115
|
+
- `:high` - The matched text is almost certainly an injection attempt.
|
|
116
|
+
- `:medium` - Likely an injection, but could appear in normal text.
|
|
117
|
+
- `:low` - Suspicious but may need human review.
|
|
118
|
+
|
|
119
|
+
### 3. Guidelines for good patterns
|
|
120
|
+
|
|
121
|
+
**DO:**
|
|
122
|
+
- Use `\b` word boundaries to anchor matches
|
|
123
|
+
- Use non-capturing groups `(?:...)` when you don't need captures
|
|
124
|
+
- Test against realistic false positives (see [Testing](#testing))
|
|
125
|
+
- Keep regexes readable - complex patterns should have comments
|
|
126
|
+
- Consider multilingual variations if applicable
|
|
127
|
+
|
|
128
|
+
**DON'T:**
|
|
129
|
+
- Write overly broad patterns that match normal English (e.g., don't match just "ignore" alone)
|
|
130
|
+
- Duplicate existing patterns - check the registry first
|
|
131
|
+
- Use lookbehinds/lookaheads unless necessary (they hurt performance)
|
|
132
|
+
- Forget the `/i` flag - injections come in all cases
|
|
133
|
+
|
|
134
|
+
### 4. Sensitivity/confidence decision guide
|
|
135
|
+
|
|
136
|
+
Ask yourself:
|
|
137
|
+
|
|
138
|
+
> "If a user submitted this text in a normal form, would it ever appear naturally?"
|
|
139
|
+
|
|
140
|
+
- **Never** (e.g., "ignore all previous instructions") -> `:low` sensitivity, `:high` confidence
|
|
141
|
+
- **Rarely** (e.g., "new instructions:") -> `:medium` sensitivity, `:medium` confidence
|
|
142
|
+
- **Sometimes** (e.g., "from now on you will") -> `:high` sensitivity, `:medium` confidence
|
|
143
|
+
- **Often** (e.g., encoded Base64 content) -> `:paranoid` sensitivity, `:low` confidence
|
|
144
|
+
|
|
145
|
+
## Testing
|
|
146
|
+
|
|
147
|
+
Every pattern needs tests. No exceptions.
|
|
148
|
+
|
|
149
|
+
### Running tests
|
|
27
150
|
|
|
28
151
|
```bash
|
|
29
|
-
#
|
|
30
|
-
bundle exec
|
|
152
|
+
# Full suite
|
|
153
|
+
bundle exec rspec
|
|
31
154
|
|
|
32
|
-
#
|
|
33
|
-
bundle exec
|
|
155
|
+
# Specific pattern tests
|
|
156
|
+
bundle exec rspec spec/promptmenot/patterns/direct_instruction_override_spec.rb
|
|
157
|
+
|
|
158
|
+
# Run a single test by line number
|
|
159
|
+
bundle exec rspec spec/promptmenot/patterns/direct_instruction_override_spec.rb:19
|
|
34
160
|
```
|
|
35
161
|
|
|
36
|
-
|
|
162
|
+
### Writing pattern tests
|
|
37
163
|
|
|
38
|
-
|
|
39
|
-
2. **Create a branch** for your feature: `git checkout -b feature/my-feature`
|
|
40
|
-
3. **Make your changes** and add tests
|
|
41
|
-
4. **Ensure all tests pass**: `bundle exec rspec`
|
|
42
|
-
5. **Ensure code is clean**: `bundle exec rubocop -a`
|
|
43
|
-
6. **Commit** with clear messages: `git commit -am 'Add new pattern for X'`
|
|
44
|
-
7. **Push** to your fork: `git push origin feature/my-feature`
|
|
45
|
-
8. **Open a PR** on GitHub
|
|
164
|
+
Pattern specs follow a consistent structure with two sections: **detections** (unsafe text that should be caught) and **false positive resistance** (safe text that should pass).
|
|
46
165
|
|
|
47
|
-
|
|
166
|
+
```ruby
|
|
167
|
+
# frozen_string_literal: true
|
|
168
|
+
|
|
169
|
+
RSpec.describe Promptmenot::Patterns::YourCategory do
|
|
170
|
+
let(:patterns) { described_class.patterns }
|
|
171
|
+
|
|
172
|
+
describe "pattern registration" do
|
|
173
|
+
it "registers patterns" do
|
|
174
|
+
expect(patterns).not_to be_empty
|
|
175
|
+
end
|
|
176
|
+
|
|
177
|
+
it "all patterns have correct category" do
|
|
178
|
+
patterns.each do |pattern|
|
|
179
|
+
expect(pattern.category).to eq(:your_category)
|
|
180
|
+
end
|
|
181
|
+
end
|
|
182
|
+
end
|
|
183
|
+
|
|
184
|
+
describe "detections" do
|
|
185
|
+
[
|
|
186
|
+
"your injection example here",
|
|
187
|
+
"another variant of the attack",
|
|
188
|
+
].each do |injection|
|
|
189
|
+
it "detects: #{injection[0..50]}" do
|
|
190
|
+
result = Promptmenot.detect(injection, sensitivity: :high)
|
|
191
|
+
expect(result).to be_unsafe, "Expected '#{injection}' to be detected as unsafe"
|
|
192
|
+
end
|
|
193
|
+
end
|
|
194
|
+
end
|
|
195
|
+
|
|
196
|
+
describe "false positive resistance" do
|
|
197
|
+
[
|
|
198
|
+
"Normal sentence that looks similar but is safe",
|
|
199
|
+
"Another benign example using similar words",
|
|
200
|
+
].each do |safe_text|
|
|
201
|
+
it "allows: #{safe_text[0..50]}" do
|
|
202
|
+
result = Promptmenot.detect(safe_text, sensitivity: :medium)
|
|
203
|
+
expect(result).to be_safe, "Expected '#{safe_text}' to pass but got: #{result.patterns_detected}"
|
|
204
|
+
end
|
|
205
|
+
end
|
|
206
|
+
end
|
|
207
|
+
end
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
**Important testing notes:**
|
|
211
|
+
|
|
212
|
+
- Detection tests use `sensitivity: :high` to ensure patterns are active
|
|
213
|
+
- False positive tests use `sensitivity: :medium` (the default) to ensure safe text isn't wrongly flagged at normal settings
|
|
214
|
+
- The custom matchers `be_safe` and `be_unsafe` are defined in `spec/spec_helper.rb`
|
|
215
|
+
- Include at least 3-5 detection examples covering variations (caps, spacing, phrasing)
|
|
216
|
+
- Include at least 3-5 false positive examples with similar-looking but safe text
|
|
217
|
+
- `Promptmenot.reset!` runs automatically between tests (configured in spec_helper)
|
|
218
|
+
|
|
219
|
+
## Code Style
|
|
220
|
+
|
|
221
|
+
We use RuboCop. Run it before submitting:
|
|
222
|
+
|
|
223
|
+
```bash
|
|
224
|
+
bundle exec rubocop # check
|
|
225
|
+
bundle exec rubocop -a # auto-fix
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
Key conventions:
|
|
229
|
+
|
|
230
|
+
- **Frozen string literals** required in all files (`# frozen_string_literal: true`)
|
|
231
|
+
- **Double quotes** for strings
|
|
232
|
+
- **Max line length**: 120 characters (relaxed for regex patterns in pattern files)
|
|
233
|
+
- **Max method length**: 20 lines
|
|
234
|
+
- **Max class length**: 150 lines
|
|
235
|
+
|
|
236
|
+
## Submitting Changes
|
|
237
|
+
|
|
238
|
+
1. **Fork** the repository
|
|
239
|
+
2. **Create a branch**: `git checkout -b feature/detect-new-attack-type`
|
|
240
|
+
3. **Write your code and tests**
|
|
241
|
+
4. **Run the full suite**: `bundle exec rake`
|
|
242
|
+
5. **Commit** with a clear message:
|
|
243
|
+
```
|
|
244
|
+
feat: add detection for [attack type]
|
|
245
|
+
|
|
246
|
+
Adds N patterns to detect [description].
|
|
247
|
+
Includes M safe-text cases for false positive resistance.
|
|
248
|
+
```
|
|
249
|
+
6. **Push** to your fork: `git push origin feature/detect-new-attack-type`
|
|
250
|
+
7. **Open a Pull Request** against `main`
|
|
251
|
+
|
|
252
|
+
### Commit message format
|
|
253
|
+
|
|
254
|
+
We loosely follow [Conventional Commits](https://www.conventionalcommits.org/):
|
|
255
|
+
|
|
256
|
+
- `feat:` new patterns, features, or capabilities
|
|
257
|
+
- `fix:` bug fixes or false positive corrections
|
|
258
|
+
- `docs:` documentation changes
|
|
259
|
+
- `test:` test-only changes
|
|
260
|
+
- `chore:` build, CI, or maintenance tasks
|
|
48
261
|
|
|
49
|
-
|
|
262
|
+
### PR checklist
|
|
50
263
|
|
|
51
|
-
|
|
52
|
-
- `name` — unique identifier
|
|
53
|
-
- `regex` — detection pattern
|
|
54
|
-
- `sensitivity` — `:low`, `:medium`, `:high`, or `:paranoid`
|
|
55
|
-
- `confidence` — `:high`, `:medium`, or `:low`
|
|
264
|
+
Before submitting, make sure:
|
|
56
265
|
|
|
57
|
-
|
|
266
|
+
- [ ] All tests pass (`bundle exec rspec`)
|
|
267
|
+
- [ ] RuboCop is clean (`bundle exec rubocop`)
|
|
268
|
+
- [ ] New patterns have both detection and false-positive tests
|
|
269
|
+
- [ ] Sensitivity and confidence levels are justified
|
|
270
|
+
- [ ] No overly broad regexes that would cause false positives at `:medium`
|
|
58
271
|
|
|
59
272
|
## Reporting Issues
|
|
60
273
|
|
|
61
|
-
|
|
274
|
+
### Bugs
|
|
275
|
+
|
|
276
|
+
Open an issue with:
|
|
277
|
+
|
|
62
278
|
- Clear description of the problem
|
|
63
|
-
- Steps to reproduce
|
|
279
|
+
- Steps to reproduce
|
|
64
280
|
- Expected vs. actual behavior
|
|
65
|
-
- Ruby
|
|
281
|
+
- Ruby and Rails version (`ruby -v`, `rails -v`)
|
|
282
|
+
- PromptMeNot version (`Promptmenot::VERSION`)
|
|
283
|
+
|
|
284
|
+
### False Positives
|
|
285
|
+
|
|
286
|
+
If PromptMeNot is flagging legitimate text:
|
|
287
|
+
|
|
288
|
+
- Include the exact text being flagged
|
|
289
|
+
- The sensitivity level you're using
|
|
290
|
+
- Which pattern(s) matched (check `result.matches`)
|
|
291
|
+
|
|
292
|
+
### Missed Injections
|
|
293
|
+
|
|
294
|
+
If you've found an injection that gets through:
|
|
295
|
+
|
|
296
|
+
- Include the injection text
|
|
297
|
+
- The sensitivity level you tested at
|
|
298
|
+
- Bonus points if you include a PR with the fix!
|
|
66
299
|
|
|
67
300
|
## License
|
|
68
301
|
|
|
69
|
-
All contributions are made under the MIT
|
|
302
|
+
All contributions are made under the [MIT License](LICENSE.txt).
|
data/README.md
CHANGED
|
@@ -1,9 +1,50 @@
|
|
|
1
1
|
# PromptMeNot
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
[](https://github.com/kevinl05/promptmenot/actions/workflows/ci.yml)
|
|
4
|
+
[](https://rubygems.org/gems/promptmenot)
|
|
5
|
+
[](https://opensource.org/licenses/MIT)
|
|
6
|
+
[](https://www.ruby-lang.org/)
|
|
4
7
|
|
|
5
|
-
-
|
|
6
|
-
|
|
8
|
+
PromptMeNot is a Ruby gem that helps protect your application against prompt injection attacks. It scans user-submitted text for malicious patterns like instruction overrides, role manipulation, delimiter injection, encoding tricks, and more. With ~70 built-in detection patterns organized across 7 attack categories, it covers direct injection (where users try to hijack your LLM through form inputs), indirect injection (where malicious prompts are stored in user content like profiles or comments, waiting for other LLMs to scrape and execute them), and resource extraction (where attackers trick AI agents into transferring funds, leaking credentials, or exhausting compute resources). It plugs into Rails with a simple ActiveModel validator or works standalone in any Ruby app, with configurable sensitivity levels so you can tune the trade-off between coverage and false positives.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
### What it catches
|
|
13
|
+
|
|
14
|
+
| Attack type | Description |
|
|
15
|
+
|---|---|
|
|
16
|
+
| **Direct injection** | Users trying to override LLM instructions via form inputs (e.g., "ignore all previous instructions") |
|
|
17
|
+
| **Indirect injection** | Malicious prompts stored in profiles, comments, or posts that target LLMs scraping or processing your site |
|
|
18
|
+
| **Delimiter attacks** | Fake system tokens, ChatML tags, and XML/markdown boundaries injected into text |
|
|
19
|
+
| **Obfuscation** | Base64-encoded payloads, zero-width characters, hex escapes, and other encoding tricks |
|
|
20
|
+
| **Resource extraction** | Crypto transfer requests, wallet drain attacks, credential theft, financial urgency manipulation |
|
|
21
|
+
|
|
22
|
+
### What it doesn't do
|
|
23
|
+
|
|
24
|
+
> PromptMeNot is a **supplemental defense layer**, not a silver bullet.
|
|
25
|
+
|
|
26
|
+
It uses pattern matching to catch known injection techniques, which means:
|
|
27
|
+
|
|
28
|
+
- **It can be bypassed.** Sufficiently creative or novel attacks may evade detection. Prompt injection is an evolving problem and no regex-based approach will catch everything.
|
|
29
|
+
- **It's not a replacement for other safeguards.** You should still use system prompts with clear boundaries, output filtering, least-privilege API access, and human review where appropriate.
|
|
30
|
+
- **It won't prevent all LLM misuse.** It focuses on the input side. It doesn't monitor or constrain what your LLM outputs.
|
|
31
|
+
|
|
32
|
+
Think of it like input validation for SQL injection: you still use parameterized queries, but rejecting `'; DROP TABLE users--` at the front door doesn't hurt. PromptMeNot is that front door check for prompt injection.
|
|
33
|
+
|
|
34
|
+
### Defense in depth
|
|
35
|
+
|
|
36
|
+
For production apps, pair PromptMeNot with other layers:
|
|
37
|
+
|
|
38
|
+
| Layer | What to do |
|
|
39
|
+
|---|---|
|
|
40
|
+
| **Structural isolation** | Wrap user input in XML delimiters (`<user_input>...</user_input>`) so the LLM treats it as data, not instructions |
|
|
41
|
+
| **System prompt design** | Explicitly tell the model to ignore instructions found inside user content |
|
|
42
|
+
| **Output validation** | Check LLM responses for leaked system prompts, PII, or unexpected behavior before returning them to users |
|
|
43
|
+
| **Least-privilege access** | Restrict what your LLM can do (read-only DB access, scoped API keys, no `eval`) |
|
|
44
|
+
|
|
45
|
+
PromptMeNot handles the fast, cheap first pass. It catches the known attacks before they cost you an API call. The layers above handle the rest.
|
|
46
|
+
|
|
47
|
+
---
|
|
7
48
|
|
|
8
49
|
## Installation
|
|
9
50
|
|
|
@@ -26,10 +67,10 @@ rails generate promptmenot:install # creates config/initializers/promptmenot.rb
|
|
|
26
67
|
|
|
27
68
|
```ruby
|
|
28
69
|
class UserProfile < ApplicationRecord
|
|
29
|
-
# Reject mode (default)
|
|
70
|
+
# Reject mode (default) — adds validation error
|
|
30
71
|
validates :bio, prompt_safety: true
|
|
31
72
|
|
|
32
|
-
# Sanitize mode
|
|
73
|
+
# Sanitize mode — strips malicious content, no error
|
|
33
74
|
validates :about_me, prompt_safety: { mode: :sanitize }
|
|
34
75
|
|
|
35
76
|
# Custom sensitivity
|
|
@@ -90,7 +131,7 @@ end
|
|
|
90
131
|
|
|
91
132
|
## Sensitivity Levels
|
|
92
133
|
|
|
93
|
-
Sensitivity controls which patterns are active. Each pattern declares a minimum sensitivity level
|
|
134
|
+
Sensitivity controls which patterns are active. Each pattern declares a minimum sensitivity level and only runs when the requested sensitivity is at or above that level.
|
|
94
135
|
|
|
95
136
|
| Pattern sensitivity | Active at `:low` | `:medium` | `:high` | `:paranoid` |
|
|
96
137
|
|---|---|---|---|---|
|
|
@@ -111,17 +152,155 @@ Sensitivity controls which patterns are active. Each pattern declares a minimum
|
|
|
111
152
|
| `encoding_obfuscation` | Base64 payloads, zero-width chars, hex escapes | ~10 |
|
|
112
153
|
| `indirect_injection` | "Dear AI", "if you are an LLM", "note to chatbot" | ~10 |
|
|
113
154
|
| `context_manipulation` | `===RESET===`, "the above is a test", prompt leaking | ~8 |
|
|
155
|
+
| `resource_extraction` | "transfer 100 SOL to", credential theft, wallet drain, endpoint exfiltration | ~10 |
|
|
114
156
|
|
|
115
157
|
## False Positive Mitigation
|
|
116
158
|
|
|
117
159
|
Patterns use contextual qualifiers to minimize false positives:
|
|
118
160
|
|
|
119
|
-
- "ignore" alone is fine
|
|
120
|
-
- "act as" requires malicious qualifiers
|
|
121
|
-
- "you are now" requires AI/restriction qualifiers
|
|
122
|
-
- "from now on" requires imperative "you must/will"
|
|
161
|
+
- "ignore" alone is fine, but "ignore **previous instructions**" is flagged
|
|
162
|
+
- "act as" requires malicious qualifiers, so "act as a consultant" passes
|
|
163
|
+
- "you are now" requires AI/restriction qualifiers, so "you are now subscribed" passes
|
|
164
|
+
- "from now on" requires imperative "you must/will", so "from now on I'll work from home" passes
|
|
123
165
|
- Broad patterns are placed at `:high`/`:paranoid` sensitivity so they don't fire at default settings
|
|
124
166
|
|
|
167
|
+
## FAQ
|
|
168
|
+
|
|
169
|
+
<details open>
|
|
170
|
+
<summary><b>Does this work without Rails?</b></summary>
|
|
171
|
+
<br>
|
|
172
|
+
|
|
173
|
+
Yes. The ActiveModel validator is the Rails integration, but the core API works in any Ruby app:
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
Promptmenot.safe?("some text")
|
|
177
|
+
Promptmenot.detect("some text")
|
|
178
|
+
Promptmenot.sanitize("some text")
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
The Railtie only loads if `Rails::Railtie` is defined.
|
|
182
|
+
|
|
183
|
+
</details>
|
|
184
|
+
|
|
185
|
+
<details>
|
|
186
|
+
<summary><b>Is this thread-safe?</b></summary>
|
|
187
|
+
<br>
|
|
188
|
+
|
|
189
|
+
Yes. The module singleton (configuration, registry) is protected by a `Monitor`. Pattern matching itself is stateless, so concurrent calls to `detect` or `sanitize` are safe.
|
|
190
|
+
|
|
191
|
+
</details>
|
|
192
|
+
|
|
193
|
+
<details>
|
|
194
|
+
<summary><b>What happens when patterns overlap?</b></summary>
|
|
195
|
+
<br>
|
|
196
|
+
|
|
197
|
+
The detector deduplicates automatically. If two patterns match overlapping regions of text, it keeps the larger match and discards the smaller one. This prevents double-counting in results and avoids garbled output in sanitize mode.
|
|
198
|
+
|
|
199
|
+
</details>
|
|
200
|
+
|
|
201
|
+
<details>
|
|
202
|
+
<summary><b>Can I use this to scan existing database records?</b></summary>
|
|
203
|
+
<br>
|
|
204
|
+
|
|
205
|
+
Yes. You can run detection against any string, not just incoming form input:
|
|
206
|
+
|
|
207
|
+
```ruby
|
|
208
|
+
UserProfile.find_each do |profile|
|
|
209
|
+
result = Promptmenot.detect(profile.bio, sensitivity: :high)
|
|
210
|
+
puts "#{profile.id}: #{result.summary}" if result.unsafe?
|
|
211
|
+
end
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
</details>
|
|
215
|
+
|
|
216
|
+
<details>
|
|
217
|
+
<summary><b>What's the performance like?</b></summary>
|
|
218
|
+
<br>
|
|
219
|
+
|
|
220
|
+
At default sensitivity (`:medium`), roughly 40-50 patterns are active. Each is a single regex scan, so detection is fast on typical user input. The `max_length` config (default: 50,000 characters) truncates excessively long inputs before scanning to prevent regex backtracking on adversarial payloads.
|
|
221
|
+
|
|
222
|
+
</details>
|
|
223
|
+
|
|
224
|
+
<details>
|
|
225
|
+
<summary><b>Can I scan only specific categories?</b></summary>
|
|
226
|
+
<br>
|
|
227
|
+
|
|
228
|
+
Yes. Both the `Detector` and `Sanitizer` accept a `categories` filter:
|
|
229
|
+
|
|
230
|
+
```ruby
|
|
231
|
+
detector = Promptmenot::Detector.new(
|
|
232
|
+
sensitivity: :high,
|
|
233
|
+
categories: [:delimiter_injection, :encoding_obfuscation]
|
|
234
|
+
)
|
|
235
|
+
result = detector.detect(user_input)
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
This is useful if you only care about certain attack types for a given field.
|
|
239
|
+
|
|
240
|
+
</details>
|
|
241
|
+
|
|
242
|
+
<details>
|
|
243
|
+
<summary><b>How does the on_detect callback work?</b></summary>
|
|
244
|
+
<br>
|
|
245
|
+
|
|
246
|
+
The callback fires whenever an injection is detected, before the result is returned. It receives the full `Result` object, so you can log, alert, or track metrics:
|
|
247
|
+
|
|
248
|
+
```ruby
|
|
249
|
+
Promptmenot.configure do |config|
|
|
250
|
+
config.on_detect = ->(result) {
|
|
251
|
+
Rails.logger.warn("Injection detected: #{result.summary}")
|
|
252
|
+
StatsD.increment("promptmenot.injection_detected")
|
|
253
|
+
}
|
|
254
|
+
end
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
If the callback raises an exception, it's rescued and printed to `warn` so it never breaks your app.
|
|
258
|
+
|
|
259
|
+
</details>
|
|
260
|
+
|
|
261
|
+
<details>
|
|
262
|
+
<summary><b>Does it catch leetspeak and Cyrillic homoglyphs?</b></summary>
|
|
263
|
+
<br>
|
|
264
|
+
|
|
265
|
+
Yes. The `encoding_obfuscation` category includes patterns for leetspeak injection (e.g., `1gn0r3 1nstruct10ns`) and mixed-script homoglyphs (Latin + Cyrillic characters in the same string). The homoglyph pattern is set to `:paranoid` sensitivity since mixed scripts can appear in legitimate multilingual content.
|
|
266
|
+
|
|
267
|
+
</details>
|
|
268
|
+
|
|
269
|
+
<details>
|
|
270
|
+
<summary><b>What's the difference between confidence and sensitivity?</b></summary>
|
|
271
|
+
<br>
|
|
272
|
+
|
|
273
|
+
They answer different questions:
|
|
274
|
+
|
|
275
|
+
- **Sensitivity** controls *when* a pattern runs. A `:low` sensitivity pattern runs at all levels. A `:paranoid` pattern only runs when you explicitly crank sensitivity up.
|
|
276
|
+
- **Confidence** describes *how certain* we are that a match is actually an attack. A `:high` confidence match (e.g., "ignore all previous instructions") is almost certainly malicious. A `:low` confidence match (e.g., mixed Cyrillic/Latin text) might be legitimate.
|
|
277
|
+
|
|
278
|
+
You can filter results by confidence after detection using `result.high_confidence_matches`.
|
|
279
|
+
|
|
280
|
+
</details>
|
|
281
|
+
|
|
282
|
+
<details>
|
|
283
|
+
<summary><b>Can I add patterns without modifying the gem source?</b></summary>
|
|
284
|
+
<br>
|
|
285
|
+
|
|
286
|
+
Yes. Use the config DSL in your initializer:
|
|
287
|
+
|
|
288
|
+
```ruby
|
|
289
|
+
Promptmenot.configure do |config|
|
|
290
|
+
config.add_pattern(
|
|
291
|
+
name: :my_app_specific_attack,
|
|
292
|
+
regex: /some pattern specific to your app/i,
|
|
293
|
+
category: :custom,
|
|
294
|
+
sensitivity: :medium,
|
|
295
|
+
confidence: :high
|
|
296
|
+
)
|
|
297
|
+
end
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
Custom patterns go through the same detection pipeline as built-in ones.
|
|
301
|
+
|
|
302
|
+
</details>
|
|
303
|
+
|
|
125
304
|
## Contributing
|
|
126
305
|
|
|
127
306
|
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on:
|
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Promptmenot
|
|
4
|
+
module Patterns
|
|
5
|
+
class ResourceExtraction < Base
|
|
6
|
+
register(
|
|
7
|
+
name: :crypto_transfer_request,
|
|
8
|
+
regex: /\b(?:transfer|send|move|withdraw)\s+\d+(?:\.\d+)?\s*(?:SOL|ETH|BTC|USDC|USDT|XRP|MATIC|AVAX|DOT|ADA|BNB|DOGE|tokens?|coins?)\s+(?:to|into|towards)\b/i,
|
|
9
|
+
sensitivity: :low,
|
|
10
|
+
confidence: :high
|
|
11
|
+
)
|
|
12
|
+
|
|
13
|
+
register(
|
|
14
|
+
name: :wallet_address_with_instruction,
|
|
15
|
+
regex: /\b(?:transfer|send|move|withdraw|deposit)\b.{0,80}(?:0x[0-9a-fA-F]{20,}|[13][a-km-zA-HJ-NP-Z1-9]{25,34}|[1-9A-HJ-NP-Za-km-z]{32,44})\b/i,
|
|
16
|
+
sensitivity: :low,
|
|
17
|
+
confidence: :high
|
|
18
|
+
)
|
|
19
|
+
|
|
20
|
+
register(
|
|
21
|
+
name: :full_balance_drain,
|
|
22
|
+
regex: /\b(?:send|transfer|move|withdraw|drain|sweep)\s+(?:all|entire|whole|every|full|remaining|total)\s+(?:of\s+)?(?:your\s+|my\s+|the\s+)?(?:balance|funds?|tokens?|holdings?|assets?|coins?|crypto|portfolio|wallet)\b/i,
|
|
23
|
+
sensitivity: :low,
|
|
24
|
+
confidence: :high
|
|
25
|
+
)
|
|
26
|
+
|
|
27
|
+
register(
|
|
28
|
+
name: :financial_urgency_manipulation,
|
|
29
|
+
regex: /\b(?:urgent(?:ly)?|immediate(?:ly)?|right\s+now|asap|time[- ]sensitive|quickly|hurry|before\s+it'?s?\s+too\s+late|window\s+closing)\b.{0,60}\b(?:transfer|send|pay|wire|withdraw|transaction|funds?|money|payment)\b/i,
|
|
30
|
+
sensitivity: :medium,
|
|
31
|
+
confidence: :medium
|
|
32
|
+
)
|
|
33
|
+
|
|
34
|
+
register(
|
|
35
|
+
name: :authorization_claim_for_transfer,
|
|
36
|
+
regex: /\b(?:i(?:'m|\s+am)\s+authorized|i\s+have\s+(?:authorization|permission|approval)|(?:authorized|approved|cleared)\s+(?:to|for)\s+(?:receive|collect|access|withdraw))\b.{0,40}\b(?:funds?|tokens?|payment|money|transfer|wallet|assets?)\b/i,
|
|
37
|
+
sensitivity: :low,
|
|
38
|
+
confidence: :high
|
|
39
|
+
)
|
|
40
|
+
|
|
41
|
+
register(
|
|
42
|
+
name: :execute_transaction_instruction,
|
|
43
|
+
regex: /\b(?:execute|submit|place|perform|make|initiate|process)\s+(?:a\s+)?(?:swap|trade|sell\s+order|buy\s+order|transaction|transfer|market\s+order|limit\s+order|exchange)\b/i,
|
|
44
|
+
sensitivity: :medium,
|
|
45
|
+
confidence: :medium
|
|
46
|
+
)
|
|
47
|
+
|
|
48
|
+
register(
|
|
49
|
+
name: :credential_extraction,
|
|
50
|
+
regex: /\b(?:give|show|tell|send|share|reveal|display|output|print|provide|paste)\s+(?:me\s+)?(?:your\s+|the\s+)?(?:API\s+key|secret\s+key|private\s+key|seed\s+phrase|mnemonic|recovery\s+phrase|password|credentials?|auth\s+token|access\s+token|wallet\s+key)\b/i,
|
|
51
|
+
sensitivity: :low,
|
|
52
|
+
confidence: :high
|
|
53
|
+
)
|
|
54
|
+
|
|
55
|
+
register(
|
|
56
|
+
name: :external_endpoint_exfiltration,
|
|
57
|
+
regex: /\b(?:send|post|submit|forward|transmit|upload|push|pipe)\s+(?:the\s+)?(?:results?|output|response|keys?|tokens?|credentials?|secrets?|data)\s+(?:to|at|via)\s+(?:https?:\/\/|webhook|endpoint|external)/i,
|
|
58
|
+
sensitivity: :low,
|
|
59
|
+
confidence: :high
|
|
60
|
+
)
|
|
61
|
+
|
|
62
|
+
register(
|
|
63
|
+
name: :grant_access_escalation,
|
|
64
|
+
regex: /\b(?:grant|give|provide|allow|enable)\s+(?:me\s+)?(?:full|admin|root|owner|unlimited|unrestricted|complete)?\s*(?:access|control|permissions?|privileges?)\s+(?:to|over|for)\s+(?:the\s+)?(?:wallet|account|funds?|system|database|keys?|credentials?|resources?)\b/i,
|
|
65
|
+
sensitivity: :medium,
|
|
66
|
+
confidence: :high
|
|
67
|
+
)
|
|
68
|
+
|
|
69
|
+
register(
|
|
70
|
+
name: :resource_exhaustion,
|
|
71
|
+
regex: /\b(?:use|exhaust|consume|burn\s+through|deplete|drain|spend|max\s+out)\s+(?:all\s+)?(?:(?:your|the|my|available|remaining)\s+)?(?:credits?|quota|budget|compute|resources?|API\s+(?:calls?|requests?)|rate\s+limit|tokens?|capacity)\b/i,
|
|
72
|
+
sensitivity: :medium,
|
|
73
|
+
confidence: :high
|
|
74
|
+
)
|
|
75
|
+
end
|
|
76
|
+
end
|
|
77
|
+
end
|
data/lib/promptmenot/version.rb
CHANGED
data/lib/promptmenot.rb
CHANGED
|
@@ -17,6 +17,7 @@ require_relative "promptmenot/patterns/delimiter_injection"
|
|
|
17
17
|
require_relative "promptmenot/patterns/encoding_obfuscation"
|
|
18
18
|
require_relative "promptmenot/patterns/indirect_injection"
|
|
19
19
|
require_relative "promptmenot/patterns/context_manipulation"
|
|
20
|
+
require_relative "promptmenot/patterns/resource_extraction"
|
|
20
21
|
require_relative "promptmenot/detector"
|
|
21
22
|
require_relative "promptmenot/sanitizer"
|
|
22
23
|
require_relative "promptmenot/validator"
|
|
@@ -81,7 +82,8 @@ module Promptmenot
|
|
|
81
82
|
Patterns::DelimiterInjection,
|
|
82
83
|
Patterns::EncodingObfuscation,
|
|
83
84
|
Patterns::IndirectInjection,
|
|
84
|
-
Patterns::ContextManipulation
|
|
85
|
+
Patterns::ContextManipulation,
|
|
86
|
+
Patterns::ResourceExtraction
|
|
85
87
|
]
|
|
86
88
|
end
|
|
87
89
|
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: promptmenot
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.2.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- promptmenot contributors
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-02-
|
|
11
|
+
date: 2026-02-25 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: activemodel
|
|
@@ -71,6 +71,7 @@ files:
|
|
|
71
71
|
- lib/promptmenot/patterns/direct_instruction_override.rb
|
|
72
72
|
- lib/promptmenot/patterns/encoding_obfuscation.rb
|
|
73
73
|
- lib/promptmenot/patterns/indirect_injection.rb
|
|
74
|
+
- lib/promptmenot/patterns/resource_extraction.rb
|
|
74
75
|
- lib/promptmenot/patterns/role_manipulation.rb
|
|
75
76
|
- lib/promptmenot/railtie.rb
|
|
76
77
|
- lib/promptmenot/result.rb
|