pompelmi 1.5.0 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,199 @@
1
+ # Concurrent Scanning
2
+
3
+ Scanning multiple files in parallel improves throughput but introduces tradeoffs around resource usage, partial failures, and connection limits. This page covers the main patterns.
4
+
5
+ ---
6
+
7
+ ## `Promise.all` — scan multiple files in parallel
8
+
9
+ `Promise.all` runs all scans concurrently and resolves when every scan completes. If any scan rejects (throws), the entire `Promise.all` rejects immediately.
10
+
11
+ ```js
12
+ const { scan, Verdict } = require('pompelmi');
13
+
14
+ const files = [
15
+ '/uploads/document.pdf',
16
+ '/uploads/photo.jpg',
17
+ '/uploads/archive.zip',
18
+ ];
19
+
20
+ const results = await Promise.all(files.map(f => scan(f)));
21
+
22
+ results.forEach((result, i) => {
23
+ if (result === Verdict.Malicious) {
24
+ console.log(`${files[i]} is malicious.`);
25
+ }
26
+ });
27
+ ```
28
+
29
+ Use `Promise.all` when:
30
+ - All files must be accepted for the request to succeed.
31
+ - You want to fail fast if any scan throws.
32
+
33
+ ---
34
+
35
+ ## `Promise.allSettled` — partial failures
36
+
37
+ `Promise.allSettled` waits for all scans to complete regardless of individual failures. Each result has a `status` of `'fulfilled'` or `'rejected'`.
38
+
39
+ ```js
40
+ const { scan, scanBuffer, Verdict } = require('pompelmi');
41
+
42
+ const files = ['/uploads/a.pdf', '/uploads/b.zip', '/uploads/c.png'];
43
+
44
+ const settled = await Promise.allSettled(
45
+ files.map(async (f) => ({ path: f, verdict: await scan(f) }))
46
+ );
47
+
48
+ const accepted = [];
49
+ const rejected = [];
50
+
51
+ for (const r of settled) {
52
+ if (r.status === 'rejected') {
53
+ rejected.push({ path: '?', reason: r.reason.message });
54
+ continue;
55
+ }
56
+ const { path, verdict } = r.value;
57
+ if (verdict === Verdict.Clean) {
58
+ accepted.push(path);
59
+ } else {
60
+ rejected.push({ path, reason: verdict.description });
61
+ }
62
+ }
63
+
64
+ console.log({ accepted, rejected });
65
+ ```
66
+
67
+ Use `Promise.allSettled` when:
68
+ - You want to process as many files as possible even if some fail.
69
+ - You need to report which specific files were rejected.
70
+
71
+ ---
72
+
73
+ ## `scanDirectory()` — scan an entire folder
74
+
75
+ `scanDirectory()` handles concurrent scanning of every file in a directory internally. It catches per-file errors and collects them into the `errors` array rather than throwing.
76
+
77
+ ```js
78
+ const fs = require('fs');
79
+ const { scanDirectory } = require('pompelmi');
80
+
81
+ const results = await scanDirectory('/uploads', {
82
+ host: process.env.CLAMAV_HOST,
83
+ port: 3310,
84
+ });
85
+
86
+ console.log(`Clean: ${results.clean.length}`);
87
+ console.log(`Malicious: ${results.malicious.length}`);
88
+ console.log(`Errors: ${results.errors.length}`);
89
+
90
+ // Auto-delete malicious files
91
+ results.malicious.forEach(f => fs.unlinkSync(f));
92
+ ```
93
+
94
+ Use `scanDirectory()` when:
95
+ - You have an existing folder of files to audit.
96
+ - You want a single-call interface with clean/malicious/errors output.
97
+
98
+ ---
99
+
100
+ ## Rate limiting concurrent scans with `p-limit`
101
+
102
+ Unbounded `Promise.all` with a large number of files can overwhelm clamd or exhaust the OS file descriptor limit. Use `p-limit` to cap concurrency.
103
+
104
+ ```bash
105
+ npm install p-limit
106
+ ```
107
+
108
+ ```js
109
+ const pLimit = require('p-limit');
110
+ const { scan, Verdict } = require('pompelmi');
111
+
112
+ const files = getFilePaths(); // array of N paths
113
+ const limit = pLimit(5); // at most 5 concurrent scans
114
+
115
+ const results = await Promise.all(
116
+ files.map(f => limit(() => scan(f, { host: 'clamav', port: 3310 })))
117
+ );
118
+ ```
119
+
120
+ Recommended concurrency limits:
121
+
122
+ | Mode | Suggested concurrency |
123
+ |------|----------------------|
124
+ | Local (`clamscan`) | 2–4 (CPU-bound) |
125
+ | TCP (single clamd) | 5–10 |
126
+ | TCP (multiple clamd replicas) | 20–50 |
127
+
128
+ Tune based on your hardware and observed clamd CPU usage.
129
+
130
+ ---
131
+
132
+ ## Concurrently scanning buffers
133
+
134
+ ```js
135
+ const { scanBuffer, Verdict } = require('pompelmi');
136
+
137
+ // req.files from multer.array()
138
+ const results = await Promise.allSettled(
139
+ req.files.map(file =>
140
+ scanBuffer(file.buffer, { host: 'clamav', port: 3310 })
141
+ .then(verdict => ({ name: file.originalname, verdict }))
142
+ )
143
+ );
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Performance considerations
149
+
150
+ ### Local mode
151
+
152
+ Each `scan()` in local mode spawns a `clamscan` child process. Spawning processes is expensive — ClamAV loads its virus database into memory on each invocation. For high-throughput local scanning, consider switching to TCP mode where a persistent `clamd` daemon keeps the database in memory.
153
+
154
+ ### TCP mode
155
+
156
+ In TCP mode, pompelmi opens a new TCP connection per scan call. For sustained high-throughput workloads, the connection overhead is measurable. Options:
157
+
158
+ 1. **Increase concurrency gradually** — start at 5, measure clamd CPU, increase until you see degradation.
159
+ 2. **Scale clamd horizontally** — run multiple clamd containers behind a load balancer.
160
+ 3. **Connection pooling** — pompelmi does not pool connections. For extremely high throughput, implement a connection pool that keeps sockets open and reuses them.
161
+
162
+ ### Memory
163
+
164
+ `scanBuffer()` holds the full file content in memory. For large files (>50 MB), prefer `scan()` (from disk) or `scanStream()` (streaming, no full buffering in TCP mode).
165
+
166
+ ---
167
+
168
+ ## Example: batch-scan upload queue
169
+
170
+ ```js
171
+ const pLimit = require('p-limit');
172
+ const { scan, Verdict } = require('pompelmi');
173
+ const fs = require('fs');
174
+
175
+ async function processBatch(filePaths) {
176
+ const limit = pLimit(8);
177
+
178
+ const results = await Promise.allSettled(
179
+ filePaths.map(filePath =>
180
+ limit(async () => {
181
+ const verdict = await scan(filePath, { host: 'clamav', port: 3310 });
182
+ return { filePath, verdict };
183
+ })
184
+ )
185
+ );
186
+
187
+ for (const r of results) {
188
+ if (r.status === 'rejected') {
189
+ console.error('Scan error:', r.reason.message);
190
+ continue;
191
+ }
192
+ const { filePath, verdict } = r.value;
193
+ if (verdict !== Verdict.Clean) {
194
+ fs.unlinkSync(filePath);
195
+ console.warn('Rejected:', filePath, verdict.description);
196
+ }
197
+ }
198
+ }
199
+ ```
@@ -0,0 +1,190 @@
1
+ # Docker Compose — Production Setup
2
+
3
+ Production-grade docker-compose configuration for running pompelmi with a ClamAV sidecar. This setup includes health checks, restart policy, persistent virus definition storage, and environment variable configuration.
4
+
5
+ ---
6
+
7
+ ## Complete `docker-compose.yml`
8
+
9
+ ```yaml
10
+ services:
11
+ app:
12
+ build: .
13
+ ports:
14
+ - "3000:3000"
15
+ environment:
16
+ NODE_ENV: production
17
+ CLAMAV_HOST: clamav
18
+ CLAMAV_PORT: "3310"
19
+ CLAMAV_TIMEOUT: "30000"
20
+ depends_on:
21
+ clamav:
22
+ condition: service_healthy
23
+ restart: unless-stopped
24
+ volumes:
25
+ - uploads:/app/uploads
26
+
27
+ clamav:
28
+ image: clamav/clamav:stable
29
+ ports:
30
+ - "3310:3310"
31
+ restart: unless-stopped
32
+ volumes:
33
+ - clamav_db:/var/lib/clamav # persist virus definitions across restarts
34
+ healthcheck:
35
+ test: ["CMD", "clamdcheck"]
36
+ interval: 30s
37
+ timeout: 10s
38
+ retries: 5
39
+ start_period: 120s # first start downloads ~300 MB of definitions
40
+
41
+ volumes:
42
+ clamav_db:
43
+ uploads:
44
+ ```
45
+
46
+ ---
47
+
48
+ ## Application code
49
+
50
+ Read options from environment variables so the same image works in all environments:
51
+
52
+ ```js
53
+ const { scan, scanBuffer, scanStream, Verdict } = require('pompelmi');
54
+
55
+ const SCAN_OPTS = {
56
+ host: process.env.CLAMAV_HOST || '127.0.0.1',
57
+ port: Number(process.env.CLAMAV_PORT) || 3310,
58
+ timeout: Number(process.env.CLAMAV_TIMEOUT) || 15_000,
59
+ };
60
+
61
+ const result = await scan('/uploads/file.pdf', SCAN_OPTS);
62
+ ```
63
+
64
+ ---
65
+
66
+ ## Dockerfile example
67
+
68
+ ```dockerfile
69
+ FROM node:22-alpine
70
+
71
+ WORKDIR /app
72
+
73
+ COPY package*.json ./
74
+ RUN npm ci --omit=dev
75
+
76
+ COPY . .
77
+
78
+ RUN mkdir -p uploads
79
+
80
+ EXPOSE 3000
81
+ CMD ["node", "src/server.js"]
82
+ ```
83
+
84
+ Note: `clamscan` does **not** need to be installed in the application container when using TCP mode. The ClamAV sidecar handles all scanning.
85
+
86
+ ---
87
+
88
+ ## Health check explanation
89
+
90
+ `clamdcheck` is a shell script bundled inside the `clamav/clamav:stable` image. It sends a `PING` to the local clamd socket and checks the response. The `start_period: 120s` gives clamd time to download virus definitions on first start before health checks begin counting failures.
91
+
92
+ If you need to check from outside the container, you can also use TCP:
93
+
94
+ ```bash
95
+ echo -n "PING" | nc -q1 localhost 3310
96
+ # expected response: PONG
97
+ ```
98
+
99
+ ---
100
+
101
+ ## `depends_on` with health check
102
+
103
+ ```yaml
104
+ depends_on:
105
+ clamav:
106
+ condition: service_healthy
107
+ ```
108
+
109
+ This prevents the application container from starting until clamd passes its health check. Without this, your app may start and immediately fail its first scan with "connection refused."
110
+
111
+ ---
112
+
113
+ ## Scaling considerations
114
+
115
+ ### Vertical scaling
116
+
117
+ ClamAV is single-threaded per scan. For high-throughput use cases, run multiple clamd containers behind a load balancer rather than trying to parallelise within one instance.
118
+
119
+ ### Horizontal scaling
120
+
121
+ ```yaml
122
+ services:
123
+ clamav:
124
+ image: clamav/clamav:stable
125
+ deploy:
126
+ replicas: 3
127
+ volumes:
128
+ - clamav_db:/var/lib/clamav
129
+ ```
130
+
131
+ Point your application at a load balancer in front of the clamd replicas. Note: each clamd replica downloads its own virus database on startup unless you share the volume (which requires care with concurrent freshclam writes).
132
+
133
+ ### Alternative: one clamd per app instance
134
+
135
+ For simpler setups, co-deploy one clamd container with each app container. Each pair shares a clamd_db volume scoped to the pair.
136
+
137
+ ---
138
+
139
+ ## Resource limits
140
+
141
+ ClamAV can use significant memory when unpacking large archives:
142
+
143
+ ```yaml
144
+ clamav:
145
+ image: clamav/clamav:stable
146
+ deploy:
147
+ resources:
148
+ limits:
149
+ memory: 1g
150
+ cpus: '1.0'
151
+ ```
152
+
153
+ Tune based on your expected file sizes. Scanning uncompressed archives >100 MB may require more.
154
+
155
+ ---
156
+
157
+ ## Keeping virus definitions fresh
158
+
159
+ The `clamav/clamav:stable` image runs `freshclam` on a schedule automatically. Verify definitions are up to date:
160
+
161
+ ```bash
162
+ docker compose exec clamav freshclam --verbose
163
+ ```
164
+
165
+ Or trigger a manual update:
166
+
167
+ ```bash
168
+ docker compose exec clamav freshclam
169
+ ```
170
+
171
+ For zero-downtime definition updates, restart the clamav container (freshclam updates on startup) without restarting the app container:
172
+
173
+ ```bash
174
+ docker compose restart clamav
175
+ ```
176
+
177
+ The named volume preserves the downloaded definitions across restarts, so only incremental updates are downloaded after the first start.
178
+
179
+ ---
180
+
181
+ ## Production checklist
182
+
183
+ - [ ] `restart: unless-stopped` on both services
184
+ - [ ] `healthcheck` configured on clamav with `start_period` ≥ 90s
185
+ - [ ] `depends_on: condition: service_healthy` on app
186
+ - [ ] `CLAMAV_HOST`, `CLAMAV_PORT`, `CLAMAV_TIMEOUT` via environment variables
187
+ - [ ] Named volume for `clamav_db` (not anonymous)
188
+ - [ ] File size limits in your HTTP server (before the scan is reached)
189
+ - [ ] Upload directory in a named volume (survives container restarts)
190
+ - [ ] Log aggregation: capture `app` and `clamav` container logs
@@ -0,0 +1,178 @@
1
+ # Docker Setup
2
+
3
+ Run ClamAV as a Docker sidecar so your application host requires no local ClamAV installation. pompelmi's TCP mode streams files directly to the clamd daemon — the API is identical to local mode.
4
+
5
+ ---
6
+
7
+ ## Why a Docker sidecar?
8
+
9
+ - **No local install** — the application container stays lean; ClamAV and its virus definitions live in a dedicated sidecar.
10
+ - **Always up-to-date definitions** — the official `clamav/clamav:stable` image runs `freshclam` on startup and periodically refreshes the database.
11
+ - **Isolation** — ClamAV runs in its own process/container; a crash or restart does not affect your application.
12
+ - **Consistent environments** — same image in development, staging, and production.
13
+
14
+ ---
15
+
16
+ ## docker-compose.yml
17
+
18
+ ```yaml
19
+ services:
20
+ app:
21
+ build: .
22
+ ports:
23
+ - "3000:3000"
24
+ environment:
25
+ CLAMAV_HOST: clamav
26
+ CLAMAV_PORT: 3310
27
+ depends_on:
28
+ clamav:
29
+ condition: service_healthy
30
+
31
+ clamav:
32
+ image: clamav/clamav:stable
33
+ ports:
34
+ - "3310:3310"
35
+ restart: unless-stopped
36
+ volumes:
37
+ - clamav_db:/var/lib/clamav # persist virus definitions across restarts
38
+ healthcheck:
39
+ test: ["CMD", "clamdcheck"] # bundled check script in clamav/clamav image
40
+ interval: 30s
41
+ timeout: 10s
42
+ retries: 5
43
+ start_period: 120s # freshclam download takes time on first boot
44
+
45
+ volumes:
46
+ clamav_db:
47
+ ```
48
+
49
+ > **First boot:** The image downloads the full virus database (~300 MB) before clamd starts accepting connections. `start_period: 120s` gives it time. On subsequent restarts the volume cache means startup is near-instant.
50
+
51
+ ---
52
+
53
+ ## Pointing pompelmi at clamd
54
+
55
+ Pass `host` and `port` to any pompelmi function. No other code changes are needed.
56
+
57
+ ```js
58
+ const { scan, scanBuffer, scanStream, scanDirectory, Verdict } = require('pompelmi');
59
+
60
+ const CLAMAV_OPTS = {
61
+ host: process.env.CLAMAV_HOST || '127.0.0.1',
62
+ port: Number(process.env.CLAMAV_PORT) || 3310,
63
+ timeout: 30_000, // ms — increase for large files
64
+ };
65
+
66
+ // scan a file by path
67
+ const result = await scan('/uploads/report.pdf', CLAMAV_OPTS);
68
+
69
+ // scan an in-memory Buffer (multer memoryStorage)
70
+ const result = await scanBuffer(req.file.buffer, CLAMAV_OPTS);
71
+
72
+ // scan a Readable stream (S3, HTTP, pipes)
73
+ const stream = s3.getObject({ Bucket, Key }).createReadStream();
74
+ const result = await scanStream(stream, CLAMAV_OPTS);
75
+
76
+ // recursively scan a directory
77
+ const results = await scanDirectory('/uploads', CLAMAV_OPTS);
78
+ ```
79
+
80
+ All four functions return the same `Verdict.Clean`, `Verdict.Malicious`, or `Verdict.ScanError` Symbols. No code changes are required when switching between local and TCP mode.
81
+
82
+ ---
83
+
84
+ ## Configuring timeout for large files
85
+
86
+ The `timeout` option sets the socket idle timeout in milliseconds (default: 15 000 ms). Increase it when scanning large archives or slow network links.
87
+
88
+ ```js
89
+ const result = await scan('/uploads/large-archive.zip', {
90
+ host: 'clamav',
91
+ port: 3310,
92
+ timeout: 120_000, // 2 minutes
93
+ });
94
+ ```
95
+
96
+ If clamd takes longer than `timeout` ms without sending data, pompelmi rejects with:
97
+
98
+ ```
99
+ clamd connection timed out after 120000ms
100
+ ```
101
+
102
+ ---
103
+
104
+ ## Production tips
105
+
106
+ ### Health checks
107
+
108
+ The `healthcheck` in the example above uses the `clamdcheck` script bundled in the official image. Your application container uses `depends_on: condition: service_healthy` so it only starts once clamd is ready.
109
+
110
+ ### Restart policy
111
+
112
+ ```yaml
113
+ restart: unless-stopped
114
+ ```
115
+
116
+ This ensures clamd comes back up after host reboots or OOM kills without manual intervention.
117
+
118
+ ### Persisting virus definitions
119
+
120
+ The named volume `clamav_db` mounts to `/var/lib/clamav` inside the container. This means:
121
+
122
+ - First start downloads definitions once (~300 MB).
123
+ - Subsequent restarts reuse the cache; `freshclam` only downloads incremental updates.
124
+ - The volume survives `docker compose down` (use `docker compose down -v` to wipe it).
125
+
126
+ ### Resource limits (optional)
127
+
128
+ ClamAV can be memory-hungry when scanning large ZIP archives. Set a limit if needed:
129
+
130
+ ```yaml
131
+ clamav:
132
+ image: clamav/clamav:stable
133
+ deploy:
134
+ resources:
135
+ limits:
136
+ memory: 1g
137
+ ```
138
+
139
+ ---
140
+
141
+ ## Troubleshooting
142
+
143
+ ### clamd not ready on startup
144
+
145
+ **Symptom:** Application starts before clamd is accepting connections; first scan fails with connection refused.
146
+
147
+ **Fix:** Add `depends_on` with `condition: service_healthy` (see example above) and ensure the `healthcheck` is configured on the clamav service. The `start_period` must be long enough for the initial database download.
148
+
149
+ ### Connection refused
150
+
151
+ **Symptom:** `ECONNREFUSED 127.0.0.1:3310`
152
+
153
+ **Causes and fixes:**
154
+
155
+ 1. clamd container is not running — `docker compose ps` to check.
156
+ 2. Wrong host — if the app is inside Docker, use the service name (`clamav`), not `127.0.0.1`.
157
+ 3. Port not exposed — verify the `ports` mapping in `docker-compose.yml`.
158
+ 4. clamd is still loading the virus database — add the `healthcheck` and `depends_on` described above.
159
+
160
+ ### Timeout errors
161
+
162
+ **Symptom:** `clamd connection timed out after 15000ms`
163
+
164
+ **Fixes:**
165
+
166
+ 1. Increase `timeout` in the options object (e.g. `timeout: 60_000`).
167
+ 2. Check clamd resource limits — if it is CPU- or memory-constrained it will scan slowly.
168
+ 3. Check network latency between app and clamav containers.
169
+
170
+ ### Virus definitions out of date
171
+
172
+ The official image runs `freshclam` periodically. If you see scan errors mentioning outdated definitions, exec into the container and run it manually:
173
+
174
+ ```bash
175
+ docker compose exec clamav freshclam
176
+ ```
177
+
178
+ Or restart the container; `freshclam` runs at startup.