@riavzon/bot-detector 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,512 @@
1
+ # bot-detector
2
+
3
+ ![Coverage](./badges/coverage.svg)
4
+
5
+ `@riavzon/bot-detector` is an express middleware that checks incoming requests through a two phase pipeline of 17 checkers across ip reputation, geolocation, tls fingerprinting, behavioral rate limiting, Tor analysis and more.
6
+
7
+ Each checker contributes a penalty score toward a configurable ban threshold. Requests that cross the threshold receive a `403` response, or are banned at the firewall level if configured.
8
+
9
+ `@riavzon/bot-detector` uses [Shield-Base](https://github.com/Sergo706/shield-base-cli) to fetch and compile its data sources into fast in memory databases. Checkers query these compiled databases synchronously, which allows the whole pipeline to make a decisions in milliseconds.
10
+
11
+
12
+ ## Features
13
+ - Comes with 17 fully configurable server checkers
14
+ - Extensible, you can easily provide your own custom checkers via `CheckerRegistry` and custom data sources.
15
+ - Self optimized, uses collected visitor data to become smarter and faster over time. Instead of running the full pipeline for known offenders, it compiles your latest database rows into local mmdb files to instantly drop past threats and high risk visitors.
16
+ - Fast, around 1.2ms median latency for the full pipeline.
17
+ - Supports multiple storages and databases sql-lite/pg/mysql /redis/lru/memory
18
+ - Comes with a cli to manage data sources and generate custom threat databases.
19
+ - Supports cjs and fully typed
20
+
21
+ ## Requirements
22
+ - Node.js 18 or later
23
+ - Express 5
24
+ - A supported database for visitor persistence
25
+
26
+ ## Installation
27
+
28
+ ```bash
29
+ npm install @riavzon/bot-detector
30
+ ```
31
+
32
+ After installation, the package runs `bot-detector init` automatically to download its data sources and validate that [mmdbctl](https://github.com/ipinfo/mmdbctl) is installed, if not it prompts you about it, and installs it automatically, it also ask you to provide an user agent that will be used to fetch [bgp](https://en.wikipedia.org/wiki/Border_Gateway_Protocol) data from bgp.tools as they requires it before they allow you to use their data, more info at [BGP.tools](https://bgp.tools/kb/api).
33
+
34
+ The compiled databases are written to `_data-sources/` inside the package directory, which include the following files:
35
+
36
+ ```fs
37
+ ├── asn.mmdb
38
+ ├── banned.mmdb // generated on demand from your visitors history data
39
+ ├── city.mmdb
40
+ ├── country.mmdb
41
+ ├── firehol_anonymous.mmdb
42
+ ├── firehol_l1.mmdb
43
+ ├── firehol_l2.mmdb
44
+ ├── firehol_l3.mmdb
45
+ ├── firehol_l4.mmdb
46
+ ├── goodBots.mmdb
47
+ ├── highRisk.mmdb // generated on demand from your visitors history data
48
+ ├── ja4-db
49
+ │ ├── ja4.mdb
50
+ │ └── ja4.mdb-lock
51
+ ├── proxy.mmdb
52
+ ├── suffix.json
53
+ ├── tor.mmdb
54
+ └── useragent-db
55
+ ├── useragent.mdb
56
+ └── useragent.mdb-lock
57
+
58
+ ```
59
+ More information about each database and its source can be found in [Shield-Base](https://github.com/Sergo706/shield-base-cli) readme.
60
+
61
+ These databases are read only, your can interact with them with `getDataSources`, for example:
62
+
63
+ ```ts
64
+ import { getDataSources } from '@riavzon/bot-detector';
65
+
66
+ const ds = getDataSources();
67
+
68
+ // ip lookups — all return null if the ip is not in the database
69
+ ds.asnDataBase(ip); // BGP/ASN record: asn_id, asn_name, classification, hits
70
+ ds.cityDataBase(ip); // city-level geo: city, region, country, lat/lon, timezone
71
+ ds.countryDataBase(ip); // country-level geo: country, countryCode, isp, org, proxy, hosting
72
+ ds.torDataBase(ip); // Tor relay record: flags, exit_addresses, version, probabilities
73
+ ds.proxyDataBase(ip); // proxy record: type, sources that flagged this ip
74
+ ds.goodBotsDataBase(ip); // known good crawler record (Googlebot, Bingbot, etc.)
75
+ ds.fireholAnonDataBase(ip); // Firehol anonymous feed match
76
+ ds.fireholLvl1DataBase(ip); // Firehol threat level 1 (most severe)
77
+ ds.fireholLvl2DataBase(ip); // Firehol threat level 2
78
+ ds.fireholLvl3DataBase(ip); // Firehol threat level 3
79
+ ds.fireholLvl4DataBase(ip); // Firehol threat level 4
80
+ ds.bannedDataBase(ip); // your banned.mmdb, generated by `bot-detector generate`
81
+ ds.highRiskDataBase(ip); // your highRisk.mmdb, generated by `bot-detector generate`
82
+
83
+ // LMDB key value stores
84
+ ds.getUserAgentLmdb().get(uaString); // user agent pattern record
85
+ ds.getJa4Lmdb().get(ja4Hash); // JA4 TLS fingerprint record
86
+ ```
87
+
88
+ ## Quick start
89
+
90
+ `defineConfiguration` is async and must resolve before you attach `detectBots` to your routes. Call it exactly once at startup before `app.listen`.
91
+
92
+ ```ts
93
+ import express from 'express';
94
+ import cookieParser from 'cookie-parser';
95
+ import { defineConfiguration, detectBots } from '@riavzon/bot-detector';
96
+
97
+ const app = express();
98
+ app.use(cookieParser());
99
+
100
+ await defineConfiguration({
101
+ store: {
102
+ main: { driver: 'mysql-pool', host: 'localhost', user: 'root', database: 'mydb' },
103
+ },
104
+ });
105
+
106
+ app.use(detectBots());
107
+
108
+ app.get('/', (req, res) => {
109
+ res.json({ banned: req.botDetection?.banned });
110
+ });
111
+ ```
112
+
113
+ ## Configuration
114
+
115
+ `defineConfiguration` accepts a configuration object. Every field has a default value, only `store.main` is required. The full schema with all defaults is defined in [`src/botDetector/types/configSchema.ts`](src/botDetector/types/configSchema.ts).
116
+
117
+ ```ts
118
+ await defineConfiguration({
119
+ // Required: database connection for visitor persistence
120
+ store: {
121
+ main: { driver: 'mysql-pool', host: 'localhost', user: 'root', database: 'mydb' },
122
+ },
123
+
124
+ // Score required to ban a visitor (0–100). Default: 100
125
+ banScore: 100,
126
+
127
+ // Maximum score assignable per request (0–100). Default: 100
128
+ maxScore: 100,
129
+
130
+ // Points the reputation healer restores per clean request. Default: 10
131
+ restoredReputationPoints: 10,
132
+
133
+ // Score persistence strategy. See "Score modes" below. Default: false
134
+ setNewComputedScore: false,
135
+
136
+ // IPs that bypass all detection. Accepts IPv4, IPv6, or CIDR strings.
137
+ whiteList: ['127.0.0.1', '::1'],
138
+
139
+ // Recheck interval for returning visitors. Default: check every request
140
+ checksTimeRateControl: {
141
+ checkEveryRequest: false,
142
+ checkEvery: 1000 * 60 * 5, // ms
143
+ },
144
+
145
+ // Async write queue that persists visitor scores without blocking requests
146
+ batchQueue: {
147
+ flushIntervalMs: 5000,
148
+ maxBufferSize: 100,
149
+ maxRetries: 3,
150
+ },
151
+
152
+ // Cache driver for visitor state, behavioral data, and sessions.
153
+ // Defaults to memory when omitted.
154
+ storage: { driver: 'redis', host: 'localhost', port: 6379 },
155
+
156
+ // Whether to issue a UFW firewall ban in addition to a 403 response. Default: false
157
+ punishmentType: {
158
+ enableFireWallBan: false,
159
+ },
160
+
161
+ // Pino log level. Default: 'info'
162
+ logLevel: 'info',
163
+
164
+ // Individual checker configuration. All checkers are enabled by default.
165
+ // See "Checker reference" for available penalty options per checker.
166
+ checkers: {
167
+ enableBehaviorRateCheck: {
168
+ enable: true,
169
+ behavioral_window: 60_000, // window duration in ms
170
+ behavioral_threshold: 30, // max requests per window before penalty applies
171
+ penalties: 60,
172
+ },
173
+ honeypot: {
174
+ enable: true,
175
+ paths: ['/admin', '/.env', '/wp-login.php'],
176
+ },
177
+ enableGeoChecks: {
178
+ enable: true,
179
+ bannedCountries: ['KP', 'IR'], // ISO 3166-1 alpha-2 codes
180
+ },
181
+ // ...other checkers
182
+ },
183
+
184
+ // Controls custom MMDB generation from your visitor data. See `bot-detector generate`.
185
+ generator: {
186
+ scoreThreshold: 70, // minimum suspicious_activity_score to include in highRisk.mmdb
187
+ generateTypes: false, // generate typescript types
188
+ deleteAfterBuild: false, // delete source rows after compiling
189
+ mmdbctlPath: 'mmdbctl', // path to mmdbctl binary
190
+ },
191
+ });
192
+
193
+ ```
194
+ After you configured the database run `bot-detector load-schema` to load the database schema before usage.
195
+
196
+ ### Score modes
197
+
198
+ `setNewComputedScore` controls how the bot score is written to the database on each request.
199
+
200
+ **`false` (default) snapshot then heal.**: The detector writes the computed score once on the visitor's first request. The reputation healer then decrements it on each subsequent clean visit. The score only decreases until the cache expires and a new snapshot is taken.
201
+
202
+ **`true` live snapshot.**: The detector overwrites the stored score on every request, then the healer immediately decrements it. Use this when you want the database to always reflect the latest computed risk.
203
+
204
+ ## Database drivers
205
+
206
+ The `store.main` field accepts the following drivers:
207
+
208
+ | Driver | Value | Notes |
209
+ |---|---|---|
210
+ | MySQL (pool) | `mysql-pool` | Peer dependency: `mysql2 >=3` |
211
+ | PostgreSQL | `postgresql` | Requires `pg` |
212
+ | SQLite | `sqlite` | Requires `better-sqlite3` |
213
+ | Cloudflare D1 | `cloudflare-d1` | Pass `binding` from the Worker environment |
214
+ | PlanetScale | `planetscale` | Pass `host`, `username`, `password` |
215
+
216
+ ```ts
217
+ // MySQL pool
218
+ { driver: 'mysql-pool', host: 'localhost', user: 'root', password: 'secret', database: 'mydb' }
219
+
220
+ // PostgreSQL
221
+ { driver: 'postgresql', connectionString: 'postgres://user:pass@localhost/mydb' }
222
+
223
+ // SQLite
224
+ { driver: 'sqlite', name: './bot-detector.db' }
225
+ ```
226
+
227
+ ## Cache drivers
228
+
229
+ The `storage` field configures where visitor state, behavioral rate data, and session records are stored between requests. When omitted, the package uses memory.
230
+
231
+ | Driver | Value | Notes |
232
+ |---|---|---|
233
+ | memory (default) | *(omit `storage`)* | Single-process only |
234
+ | LRU cache | `lru` | In-process LRU; configure `max` and `ttl` |
235
+ | Redis | `redis` | Shared across instances; requires `ioredis` |
236
+ | Upstash Redis | `upstash` | Serverless Redis via HTTP |
237
+ | Filesystem | `fs` | Persistent local storage for development |
238
+ | Cloudflare KV (binding) | `cloudflare-kv-binding` | Pass `binding` |
239
+ | Cloudflare KV (HTTP) | `cloudflare-kv-http` | Pass `accountId`, `namespaceId`, `apiToken` |
240
+ | Cloudflare R2 | `cloudflare-r2-binding` | Pass `binding` |
241
+ | Vercel | `vercel` | Vercel Runtime Cache |
242
+
243
+ ## Checker reference
244
+
245
+ All 17 checkers are enabled with sensible defaults. To disable a checker, pass `{ enable: false }` for its config key. To adjust penalties, pass `{ enable: true, penalties: { ... } }` with the values you want to override.
246
+
247
+ | Checker | Config key | Phase | What it detects |
248
+ |---|---|---|---|
249
+ | ip validation | `enableIpChecks` | cheap | Invalid or unresolvable client ip |
250
+ | Known good bots | `enableGoodBotsChecks` | cheap | Legitimate crawlers (Googlebot, Bingbot, etc.)|
251
+ | Browser and device | `enableBrowserAndDeviceChecks` | cheap | CLI/library user agent types, Internet Explorer, impossible browser and OS combinations |
252
+ | Locale consistency | `localeMapsCheck` | cheap | Mismatch between `Accept-Language` header and geo locale |
253
+ | FireHOL threat feeds | `enableKnownThreatsDetections` | cheap | IPs in FireHOL levels 1–4 and the anonymizer feed |
254
+ | ASN classification | `enableAsnClassification` | cheap | Hosting and content ASNs with low route visibility |
255
+ | Tor node analysis | `enableTorAnalysis` | cheap | Exit nodes, guard nodes, bad exits, and obsolete Tor versions |
256
+ | Timezone consistency | `enableTimezoneConsistency` | cheap | Mismatch between declared timezone and geo timezone |
257
+ | Honeypot paths | `honeypot` | cheap | Requests to configured trap URLs |
258
+ | Known bad IPs | `enableKnownBadIpsCheck` | cheap | IPs in your custom `highRisk.mmdb` |
259
+ | Behavioral rate | `enableBehaviorRateCheck` | heavy | Request count exceeding the configured threshold within the window |
260
+ | Proxy / ISP / cookie | `enableProxyIspCookiesChecks` | heavy | Proxy and VPN detection, missing canary cookie, unknown ISP or org |
261
+ | user agent and headers | `enableUaAndHeaderChecks` | heavy | Headless browsers, short user agents,tls fingerprint mismatch, header anomalies |
262
+ | Geo location | `enableGeoChecks` | heavy | Missing geo fields, banned countries |
263
+ | Session coherence | `enableSessionCoherence` | heavy | Referer mismatches and cross-site navigation inconsistencies |
264
+ | Velocity fingerprint | `enableVelocityFingerprint` | heavy | Unnaturally consistent inter-request timing |
265
+ | Bad user agent list | `knownBadUserAgents` | heavy | user agents matching the LMDB pattern library (critical → low severity) |
266
+
267
+ ### Detection phases
268
+
269
+ The pipeline runs in two phases to keep latency low.
270
+
271
+ **Cheap phase** runs on every request. All lookups are synchronous, in memory reads from MMDB or LMDB dbs. When the accumulated score reaches `banScore` during this phase, the middleware rejects the request and skips the heavy phase entirely.
272
+
273
+ **Heavy phase** runs only when the cheap-phase score stays below `banScore`. These checkers read from the visitor cache or perform async operations.
274
+
275
+ ## Request object
276
+
277
+ On every request that passes detection, the middleware populates `req.botDetection`:
278
+
279
+ ```ts
280
+ req.botDetection: {
281
+ success: boolean,
282
+ banned: boolean,
283
+ time: string, // ISO timestamp
284
+ ipAddress: string
285
+ }
286
+ ```
287
+
288
+ ## Custom checkers
289
+
290
+ You can add your own checkers to the pipeline without modifying any package files. Each checker is a class that implements `IBotChecker` and registers itself via `CheckerRegistry.register()`. The middleware picks it up automatically.
291
+
292
+ See [CUSTOM.md](CUSTOM.md) for the full guide, which covers:
293
+
294
+ - The `IBotChecker` interface and phase selection
295
+ - All fields available on `ValidationContext` (geo, Tor, ASN, parsed user agent, proxy, threat level, cookies, and more)
296
+ - Typed custom context via `buildCustomContext` for full IntelliSense on `ctx.custom`
297
+ - Triggering an immediate ban via the `BAD_BOT_DETECTED` reason code
298
+ - Writing async checkers with your own cache
299
+
300
+ For example, you may use a client side detection tools that collects data, and then can be send to your custom checker for analysis:
301
+
302
+ ```ts
303
+ // types/clientSignals.ts
304
+ export interface ClientSignals {
305
+ hasWebDriver: boolean;
306
+ screenResolution: string | null;
307
+ touchPoints: number;
308
+ }
309
+
310
+ // server.ts
311
+ import { detectBots } from '@riavzon/bot-detector';
312
+ import type { ClientSignals } from './types/clientSignals.js';
313
+
314
+ app.use(
315
+ detectBots<ClientSignals>((req) => {
316
+ try {
317
+ return JSON.parse(req.headers['x-client-signals'] as string);
318
+ } catch {
319
+ return { hasWebDriver: false, screenResolution: null, touchPoints: 0 };
320
+ }
321
+ // or use ctx.req in ur checker directly
322
+ })
323
+ );
324
+
325
+ // checkers/clientSideChecker.ts
326
+ import { CheckerRegistry, getDataSources, getStorage } from '@riavzon/bot-detector';
327
+ import type { IBotChecker, ValidationContext, BotDetectorConfig, BanReasonCode } from '@riavzon/bot-detector';
328
+ import type { ClientSignals } from '../types/clientSignals.js';
329
+
330
+ class ClientSideChecker implements IBotChecker<BanReasonCode, ClientSignals> {
331
+ name = 'client-side-signals';
332
+ phase = 'cheap' as const;
333
+
334
+ isEnabled(_config: BotDetectorConfig) {
335
+ return true;
336
+ }
337
+
338
+ async run(ctx: ValidationContext<ClientSignals>, _config: BotDetectorConfig) {
339
+ const reasons: BanReasonCode[] = [];
340
+ let score = 0;
341
+
342
+ if (ctx.custom.hasWebDriver) {
343
+ reasons.push('BAD_BOT_DETECTED'); // immediate ban, no score needed
344
+ return { score, reasons };
345
+ }
346
+
347
+ const cached = await getStorage().getItem<number>(`client-signals:${ctx.ipAddress}`);
348
+ if (cached !== null) {
349
+ return { score: cached, reasons: cached > 0 ? (['BAD_BOT_DETECTED'] as BanReasonCode[]) : [] };
350
+ }
351
+
352
+ const ja4Hash = ctx.req.headers['x-ja4'] as string | undefined;
353
+ if (ja4Hash) {
354
+ const ja4Record = getDataSources().getJa4Lmdb().get(ja4Hash);
355
+ if (ja4Record?.is_bot) {
356
+ score += 60;
357
+ reasons.push('BAD_BOT_DETECTED');
358
+ }
359
+ }
360
+
361
+ if (ctx.custom.screenResolution === null) score += 20;
362
+ if (ctx.tor.exit_addresses) score += 30;
363
+
364
+ // Combine stuff
365
+ if (ctx.tor.running && !ctx.touchPoints) {
366
+ score += 40;
367
+ }
368
+
369
+ await getStorage().setItem(`client-signals:${ctx.ipAddress}`, score, { ttl: 60 * 5 });
370
+ return { score, reasons };
371
+ }
372
+ }
373
+
374
+ CheckerRegistry.register(new ClientSideChecker());
375
+ ```
376
+
377
+ ## CLI
378
+
379
+ The package ships a cli with three subcommands.
380
+
381
+ ### `init`
382
+
383
+ Runs the installation wizard. Verifies that [mmdbctl](https://github.com/ipinfo/mmdbctl) is installed (and installs it if not), prompts for a BGP.tools contact string, then compiles all data sources in parallel:
384
+
385
+ - BGP and ASN data
386
+ - City and geography databases
387
+ - Tor node lists
388
+ - Proxy and anonymizer lists
389
+ - Threat levels 1-4 and the anonymous feed
390
+ - Verified crawler ip ranges (Googlebot, Bingbot, Apple, Meta, etc.)
391
+ - user agent pattern (`useragent.mdb`)
392
+ - JA4 fingerprint database (`ja4.mdb`)
393
+
394
+ The compiled databases are written to `_data-sources/` inside the package directory.
395
+
396
+ In non interactive environments, `init` skips silently if the databases already exist. If they do not exist, it prints a warning and exits without failing.
397
+
398
+ ```bash
399
+ npx bot-detector init
400
+ ```
401
+
402
+ ### `refresh`
403
+
404
+ Redownloads and recompiles all data sources that the module uses, using the cached configuration. Requires `init` to have been run at least once.
405
+
406
+ ```bash
407
+ npx bot-detector refresh
408
+ ```
409
+ Run this at least ones every 24h.
410
+ More info [Shield-Base readme](https://github.com/Sergo706/shield-base-cli)
411
+
412
+ ### `generate`
413
+
414
+ Reads your database and compiles two custom mmdb files:
415
+
416
+ - `banned.mmdb`: built from all rows in the `banned` table with a non null ip address
417
+ - `highRisk.mmdb`: built from `visitors` rows where `suspicious_activity_score >= generator.scoreThreshold`
418
+
419
+ Requires mmdbctl. If the path in `generator.mmdbctlPath` cannot be resolved, the command prompts to install it and exits with instructions.
420
+
421
+ ```bash
422
+ npx bot-detector generate
423
+ ```
424
+
425
+ Run this periodically or after bulk ban operations.
426
+
427
+ ## API
428
+
429
+ ### `defineConfiguration(config)`
430
+
431
+ Initializes the middleware. Opens all mmdb and lmdb databases, starts the batch write queue, and sets up the cache and database connection. Call it once before attaching `detectBots` to your app.
432
+
433
+ ### `detectBots(buildCustomContext?)`
434
+
435
+ Returns an Express `RequestHandler`. Always call it as a factory, use `detectBots()`. The optional `buildCustomContext` function runs once per request before any checker executes and populates `ctx.custom` with typed data you define.
436
+
437
+ ```ts
438
+ app.use(
439
+ detectBots<MyContext>((req) => ({
440
+ userId: req.user?.id ?? 'anonymous',
441
+ plan: req.user?.plan ?? 'free',
442
+ }))
443
+ );
444
+ ```
445
+
446
+ ### `ApiResponse`
447
+
448
+ An Express `Router` that mounts `detectBots()` at `/check` and returns `{ results: req.botDetection, message: 'Fingerprint logged successfully' }`.
449
+
450
+ ```ts
451
+ import { ApiResponse } from '@riavzon/bot-detector';
452
+ app.use('/bot', ApiResponse); // POST /bot/check
453
+ ```
454
+
455
+ ### `getDataSources()`
456
+
457
+ Returns the initialized `DataSources` instance. Throws if called before `defineConfiguration` resolves.
458
+
459
+ ### `getStorage()`
460
+
461
+ Returns the initialized `Storage` instance. Throws if called before `defineConfiguration` resolves.
462
+
463
+ ### `getBatchQueue()`
464
+
465
+ Returns the initialized `BatchQueue` instance used for deferred database writes. Throws if called before `defineConfiguration` resolves.
466
+
467
+ ### `runGeneration()`
468
+
469
+ Programmatic equivalent of `bot-detector generate`. Compiles `banned.mmdb` and `highRisk.mmdb` from your database. If `generator.deleteAfterBuild` is `true`, source rows are deleted after each successful compile.
470
+
471
+ ### `banIp(ip, info)`
472
+
473
+ Issues a ufw firewall rule (`sudo ufw insert 1 deny from <ip>`) to block the ip at the OS level. Only runs when `punishmentType.enableFireWallBan` is `true`, returns immediately otherwise. Requires the Node.js process to have passwordless `sudo` access to `ufw`.
474
+
475
+ ### `parseUA(uaString)`
476
+
477
+ Parses a user agent string and returns a `ParsedUAResult` with browser name, version, OS, device type, vendor, and model.
478
+
479
+ ### `getGeoData(ip)`
480
+
481
+ Returns the full `GeoResponse` for any ip address using the mmdb databases. Useful for geo lookups outside the middleware context.
482
+
483
+ ### `updateIsBot(isBot, cookie)`
484
+
485
+ Updates the `is_bot` column in the `visitors` table for the given `canary_id`.
486
+
487
+ ### `updateBannedIP(cookie, ipAddress, country, userAgent, info)`
488
+
489
+ Upserts a row into the `banned` table with the visitor's canary cookie, ip address, country, user agent, ban reasons, and score.
490
+
491
+ ### `warmUp()`
492
+
493
+ Warms the database connection pool by running parallel `SELECT 1` queries, then fires a dummy visitor query to prime the query plan cache. Call this after `defineConfiguration` resolves but before the server starts accepting traffic.
494
+
495
+ ### `updateVisitors(data, cookie, visitorId)`
496
+
497
+ Updates the full fingerprint record in the `visitors` table for a given canary and visitor id pair. Returns `{ success: boolean, reason?: string }`.
498
+
499
+ ### `CheckerRegistry`
500
+
501
+ Registry for custom bot checker plugins. Use `CheckerRegistry.register(checker)` to add a checker that implements `IBotChecker`. Checkers are partitioned into `cheap` and `heavy` phases and filtered by your config at runtime.
502
+
503
+ ### `BadBotDetected` / `GoodBotDetected`
504
+
505
+ Error subclasses thrown (or catchable) when a checker conclusively identifies a bad or good bot. Re-exported from `helpers/exceptions` for use in custom checkers and error-handling middleware.
506
+
507
+ ---
508
+ A dedicated documentation site is coming soon.
509
+
510
+ ## License
511
+
512
+ Apache-2.0
@@ -0,0 +1,98 @@
1
+ {
2
+ "yandex": {
3
+ "suffix": ["yandex.ru", "yandex.net", "yandex.com"],
4
+ "description": "Yandex is a Russian multinational corporation specializing in Internet-related products and services, including a search engine, online advertising, and more.",
5
+ "docs": "https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.html?lang=en",
6
+ "useragent": "YandexBot"
7
+ },
8
+ "mj12bot": {
9
+ "suffix": ["mj12bot.com", "majestic.com"],
10
+ "description": "MJ12bot is a web crawler used for SEO and data analysis.",
11
+ "docs": "https://mj12bot.com/",
12
+ "useragent": "MJ12bot"
13
+ },
14
+ "discord": {
15
+ "suffix": "discord.com",
16
+ "description": "Discord is a VoIP, instant messaging and digital distribution platform.",
17
+ "docs": "https://discord.com/",
18
+ "useragent": "Discordbot"
19
+ },
20
+ "baidu": {
21
+ "suffix": ["baidu.com", "baidu.jp"],
22
+ "description": "Baidu is a Chinese multinational technology company specializing in Internet-related services and products.",
23
+ "docs": "https://www.baidu.com/",
24
+ "useragent": "Baiduspider"
25
+ },
26
+ "yahoo": {
27
+ "suffix": ["yahoo.com", "crawl.yahoo.net"],
28
+ "description": "Yahoo is a web services provider and a subsidiary of Verizon Communications.",
29
+ "docs": "https://www.yahoo.com/",
30
+ "useragent": "Yahoo! Slurp"
31
+ },
32
+ "google": {
33
+ "suffix": ["googlebot.com", "googleusercontent.com", "google.com", "googlezip.net"],
34
+ "description": "Allow google search engine and other related services of google to index your site.",
35
+ "docs": "https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot",
36
+ "useragent": "Googlebot"
37
+ },
38
+ "bing": {
39
+ "suffix": "msn.com",
40
+ "description": "Bing is a web search engine owned and operated by Microsoft.",
41
+ "docs": "https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26",
42
+ "useragent": "Bingbot"
43
+ },
44
+ "apple": {
45
+ "suffix": "apple.com",
46
+ "description": "Allow Apple's web crawler to access your site, which is used for indexing and other services like Siri.",
47
+ "docs": "https://support.apple.com/en-us/119829",
48
+ "useragent": "Applebot"
49
+ },
50
+ "openAI": {
51
+ "suffix": "openai.com",
52
+ "description": "OpenAI is an artificial intelligence research organization.",
53
+ "docs": "https://platform.openai.com/docs/bots/",
54
+ "useragent": ["GPTBot", "OAI-SearchBot", "ChatGPT-User"]
55
+ },
56
+ "ahrefs": {
57
+ "suffix": "ahrefs.com",
58
+ "description": "Ahrefs is a toolset for SEO and marketing.",
59
+ "docs": "https://help.ahrefs.com/en/articles/78658-what-is-the-list-of-your-ip-ranges",
60
+ "useragent": "AhrefsBot"
61
+ },
62
+ "CCrawler": {
63
+ "suffix": "commoncrawl.org",
64
+ "description": "CCrawler is a web crawler for the Common Crawl project.",
65
+ "docs": "https://commoncrawl.org/",
66
+ "useragent": "CCBot"
67
+ },
68
+ "xAndTwitterIPList": {
69
+ "suffix": ["x.com", "twttr.net", "twitter.com", "twttr.com"],
70
+ "description": "Let x/twitter to access your site and display rich content on posts sharing your site.",
71
+ "docs": "https://developer.x.com/en/docs/x-for-websites/cards/guides/troubleshooting-cards",
72
+ "useragent": "Twitterbot"
73
+ },
74
+ "facebook": {
75
+ "suffix": ["fbsv.net", "facebook.com", "tfbnw.net"],
76
+ "description": "Let meta platforms to access your site and display rich content on posts sharing your site.",
77
+ "docs": "https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/",
78
+ "useragent": ["FacebookBot", "facebookexternalhit", "facebookcatalog"]
79
+ },
80
+ "pintrest": {
81
+ "suffix": ["pinterest.com", "pinterestcrawler.com"],
82
+ "description": "Let Pinterest to access your site and display rich content on posts sharing your site.",
83
+ "docs": "https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/",
84
+ "useragent": "Pinterestbot"
85
+ },
86
+ "semrush": {
87
+ "suffix": "semrush.com",
88
+ "description": "Semrush is a tool for SEO, PPC, content, social media, and competitive research.",
89
+ "docs": "https://www.semrush.com/",
90
+ "useragent": "SemrushBot"
91
+ },
92
+ "telegram": {
93
+ "suffix": ["tdesktop.com", "t.me", "telegram.org"],
94
+ "description": "Telegram is a cloud-based instant messaging, VoIP, and VoIP service.",
95
+ "docs": "https://telegram.org/",
96
+ "useragent": "TelegramBot"
97
+ }
98
+ }
@@ -0,0 +1,34 @@
1
+ //#region \0rolldown/runtime.js
2
+ var __create = Object.create;
3
+ var __defProp = Object.defineProperty;
4
+ var __getOwnPropDesc = Object.getOwnPropertyDescriptor;
5
+ var __getOwnPropNames = Object.getOwnPropertyNames;
6
+ var __getProtoOf = Object.getPrototypeOf;
7
+ var __hasOwnProp = Object.prototype.hasOwnProperty;
8
+ var __copyProps = (to, from, except, desc) => {
9
+ if (from && typeof from === "object" || typeof from === "function") {
10
+ for (var keys = __getOwnPropNames(from), i = 0, n = keys.length, key; i < n; i++) {
11
+ key = keys[i];
12
+ if (!__hasOwnProp.call(to, key) && key !== except) {
13
+ __defProp(to, key, {
14
+ get: ((k) => from[k]).bind(null, key),
15
+ enumerable: !(desc = __getOwnPropDesc(from, key)) || desc.enumerable
16
+ });
17
+ }
18
+ }
19
+ }
20
+ return to;
21
+ };
22
+ var __toESM = (mod, isNodeMode, target) => (target = mod != null ? __create(__getProtoOf(mod)) : {}, __copyProps(isNodeMode || !mod || !mod.__esModule ? __defProp(target, "default", {
23
+ value: mod,
24
+ enumerable: true
25
+ }) : target, mod));
26
+
27
+ //#endregion
28
+
29
+ Object.defineProperty(exports, '__toESM', {
30
+ enumerable: true,
31
+ get: function () {
32
+ return __toESM;
33
+ }
34
+ });