npm - spamscanner - Versions diffs - 3.0.6 → 5.0.0 - Mend

spamscanner 3.0.6 → 5.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -2,9 +2,7 @@
   <a href="https://spamscanner.net"><img src="https://d1i8ikybhfrv4r.cloudfront.net/spamscanner.png" alt="spamscanner" /></a>
 </h1>
 <div align="center">
-  <a href="https://join.slack.com/t/ladjs/shared_invite/zt-fqei6z11-Bq2trhwHQxVc5x~ifiZG0g"><img src="https://img.shields.io/badge/chat-join%20slack-brightgreen" alt="chat" /></a>
-  <a href="https://travis-ci.com/spamscanner/spamscanner"><img src="https://travis-ci.com/spamscanner/spamscanner.svg?branch=master" alt="build status" /></a>
-  <a href="https://codecov.io/github/spamscanner/spamscanner"><img src="https://img.shields.io/codecov/c/github/spamscanner/spamscanner/master.svg" alt="code coverage" /></a>
+  <a href="https://github.com/spamscanner/spamscanner/actions/workflows/ci.yml"><img src="https://github.com/spamscanner/spamscanner/actions/workflows/ci.yml/badge.svg" alt="build status" /></a>
   <a href="https://github.com/sindresorhus/xo"><img src="https://img.shields.io/badge/code_style-XO-5ed9c7.svg" alt="code style" /></a>
   <a href="https://github.com/prettier/prettier"><img src="https://img.shields.io/badge/styled_with-prettier-ff69b4.svg" alt="styled with prettier" /></a>
   <a href="https://lass.js.org"><img src="https://img.shields.io/badge/made_with-lass-95CC28.svg" alt="made with lass" /></a>
@@ -48,6 +46,7 @@
   * [`scanner.getVirusResults(mail)`](#scannergetvirusresultsmail)
   * [`scanner.parseLocale(locale)`](#scannerparselocalelocale)
 * [Caching](#caching)
+* [Debugging](#debugging)
 * [Contributors](#contributors)
 * [References](#references)
 * [License](#license)
@@ -188,11 +187,48 @@ Note that you can simply use the Spam Scanner API for free at <https://spamscann
 2. Configure ClamAV:
    ```sh
+   # if you are on Intel macOS
+   sudo mv /usr/local/etc/clamav/clamd.conf.sample /usr/local/etc/clamav/clamd.conf
+   # if you are on M1 macOS (or newer brew which installs to `/opt/homebrew`)
+   sudo mv /opt/homebrew/etc/clamav/clamd.conf.sample /opt/homebrew/etc/clamav/clamd.conf
+   ```
+   ```sh
+   # if you are on Intel macOS
+   sudo vim /usr/local/etc/clamav/clamd.conf
+   # if you are on M1 macOS (or newer brew which installs to `/opt/homebrew`)
+   sudo vim /opt/homebrew/etc/clamav/clamd.conf
+   ```
+   ```diff
+   -Example
+   +#Example
+   -#StreamMaxLength 10M
+   +StreamMaxLength 50M
+   +# this file path may be different on your OS (that's OK)
+   \-#LocalSocket /tmp/clamd.socket
+   \+LocalSocket /tmp/clamd.socket
+   ```
+   ```sh
+   # if you are on Intel macOS
    sudo mv /usr/local/etc/clamav/freshclam.conf.sample /usr/local/etc/clamav/freshclam.conf
+   # if you are on M1 macOS (or newer brew which installs to `/opt/homebrew`)
+   sudo mv /opt/homebrew/etc/clamav/freshclam.conf.sample /opt/homebrew/etc/clamav/freshclam.conf
    ```
    ```sh
+   # if you are on Intel macOS
    sudo vim /usr/local/etc/clamav/freshclam.conf
+   # if you are on M1 macOS (or newer brew which installs to `/opt/homebrew`)
+   sudo vim /opt/homebrew/etc/clamav/freshclam.conf
    ```
    ```diff
@@ -210,6 +246,8 @@ Note that you can simply use the Spam Scanner API for free at <https://spamscann
    sudo vim /Library/LaunchDaemons/org.clamav.clamd.plist
    ```
+   > If you are on Intel macOS:
    ```plist
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
@@ -231,12 +269,37 @@ Note that you can simply use the Spam Scanner API for free at <https://spamscann
    </plist>
    ```
+   > If you are on M1 macOS (or newer brew which installs to `/opt/homebrew`)
+   ```plist
+   <?xml version="1.0" encoding="UTF-8"?>
+   <!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+   <plist version="1.0">
+   <dict>
+     <key>Label</key>
+     <string>org.clamav.clamd</string>
+     <key>KeepAlive</key>
+     <true/>
+     <key>Program</key>
+     <string>/opt/homebrew/sbin/clamd</string>
+     <key>ProgramArguments</key>
+     <array>
+       <string>clamd</string>
+     </array>
+     <key>RunAtLoad</key>
+     <true/>
+   </dict>
+   </plist>
+   ```
+4. Enable it and start it on boot:
    ```sh
    sudo launchctl load /Library/LaunchDaemons/org.clamav.clamd.plist
    sudo launchctl start /Library/LaunchDaemons/org.clamav.clamd.plist
    ```
-4. You may want to periodically run `freshclam` to update the config, or configure a similar `plist` configuration for `launchctl`.
+5. You may want to periodically run `freshclam` to update the config, or configure a similar `plist` configuration for `launchctl`.
 ## Install
@@ -244,7 +307,7 @@ Note that you can simply use the Spam Scanner API for free at <https://spamscann
 [npm][]:
 ```sh
-npm install spamscanner node-snowball
+npm install spamscanner
 ```
@@ -359,7 +422,7 @@ Currently Spam Scanner supports the following locales for tokenization, stemming
 | Finnish    | `fn`       |
 | Farsi      | `fa`       |
 | French     | `fr`       |
-| German     | `gr`       |
+| German     | `de`       |
 | Hungarian  | `hr`       |
 | Indonesian | `in`       |
 | Italian    | `it`       |
@@ -406,7 +469,7 @@ A common example of this is a link of `рaypal.com` which when converted to ASCI
 This method checks against [Cloudflare for Families](https://developers.cloudflare.com/1.1.1.1/1.1.1.1-for-families) servers for both adult-related content, malware, and phishing.  This means we do two separate DNS over HTTPS requests to `1.1.1.2` for malware and `1.1.1.3` for adult-related content.  You can parse the messages results Array for messages that contain "adult-related content" if you need to parse whether or not you want to flag for adult-related content or not on your application.
-If you are using Cloudflare for Families DNS servers as mentioned in [Requirements](#requirements)), then if there are any HTTPS over DNS request errors, it will fallback to use the DNS servers set on the system for lookups, which would in turn use Cloudflare for Family DNS. (using DNS over HTTPS with a fallback of [dns.resolve4](https://nodejs.org/api/dns.html#dns_dns_resolve4\_hostname_options_callback)) – and if it returns `0.0.0.0` then it is considered to be phishing.
+If you are using Cloudflare for Families DNS servers as mentioned in [Requirements](#requirements)), then if there are any HTTPS over DNS request errors, it will fallback to use the DNS servers set on the system for lookups, which would in turn use Cloudflare for Family DNS. (using DNS over HTTPS with a fallback of [dns.resolve4](https://nodejs.org/api/dns.html#dns_dns_resolve4_hostname_options_callback)) – and if it returns `0.0.0.0` then it is considered to be phishing.
 We actually helped Cloudflare in August 2020 to update their documentation to note that this result of `0.0.0.0` is returned for maliciously found content on FQDN and IP lookups.
@@ -501,6 +564,13 @@ const scanner = new SpamScanner({
 Note that in [Forward Email][forward-email] we use the `client` approach as we have multiple threads across multiple servers running, and in-memory caching would not be efficient.
+## Debugging
+Spam Scanner has built-in debug output via `util.debuglog('spamscanner')`.
+This means you can run your app with `NODE_DEBUG=spamscanner node app.js` to get useful debug output to your console.
 ## Contributors
 | Name             | Website                    |

package/index.js CHANGED Viewed

@@ -1,8 +1,9 @@
+const process = require('process');
 const dns = require('dns');
 const fs = require('fs');
-const { promisify } = require('util');
+const { debuglog } = require('util');
-// eslint-disable-next-line node/no-deprecated-api
+// eslint-disable-next-line n/no-deprecated-api
 const punycode = require('punycode');
 const ClamScan = require('clamscan');
@@ -12,7 +13,6 @@ const RE2 = require('re2');
 const bitcoinRegex = require('bitcoin-regex');
 const contractions = require('expand-contractions');
 const creditCardRegex = require('credit-card-regex');
-const debug = require('debug')('spamscanner');
 const emailRegexSafe = require('email-regex-safe');
 const emojiPatterns = require('emoji-patterns');
 const escapeStringRegexp = require('escape-string-regexp');
@@ -46,12 +46,15 @@ const toEmoji = require('gemoji/name-to-emoji');
 const universalify = require('universalify');
 const urlRegexSafe = require('url-regex-safe');
 const validator = require('validator');
+const which = require('which');
 const { Iconv } = require('iconv');
 const { codes } = require('currency-codes');
 const { fromUrl, NO_HOSTNAME } = require('parse-domain');
 const { parse } = require('node-html-parser');
 const { simpleParser } = require('mailparser');
+const debug = debuglog('spamscanner');
 const aggressiveTokenizer = new natural.AggressiveTokenizer();
 const orthographyTokenizer = new natural.OrthographyTokenizer({
   language: 'fi'
@@ -69,20 +72,115 @@ const aggressiveTokenizerSv = new natural.AggressiveTokenizerSv();
 const aggressiveTokenizerRu = new natural.AggressiveTokenizerRu();
 const aggressiveTokenizerVi = new natural.AggressiveTokenizerVi();
-const stopwordsEn = require('natural/lib/natural/util/stopwords').words;
-const stopwordsEs = require('natural/lib/natural/util/stopwords_es').words;
-const stopwordsFa = require('natural/lib/natural/util/stopwords_fa').words;
-const stopwordsFr = require('natural/lib/natural/util/stopwords_fr').words;
-const stopwordsId = require('natural/lib/natural/util/stopwords_id').words;
-const stopwordsJa = require('natural/lib/natural/util/stopwords_ja').words;
-const stopwordsIt = require('natural/lib/natural/util/stopwords_it').words;
-const stopwordsNl = require('natural/lib/natural/util/stopwords_nl').words;
-const stopwordsNo = require('natural/lib/natural/util/stopwords_no').words;
-const stopwordsPl = require('natural/lib/natural/util/stopwords_pl').words;
-const stopwordsPt = require('natural/lib/natural/util/stopwords_pt').words;
-const stopwordsRu = require('natural/lib/natural/util/stopwords_ru').words;
-const stopwordsSv = require('natural/lib/natural/util/stopwords_sv').words;
-const stopwordsZh = require('natural/lib/natural/util/stopwords_zh').words;
+const stopwordsEn = new Set([
+  ...require('natural/lib/natural/util/stopwords').words,
+  ...sw.eng
+]);
+const stopwordsEs = new Set([
+  ...require('natural/lib/natural/util/stopwords_es').words,
+  ...sw.spa
+]);
+const stopwordsFa = new Set([
+  ...require('natural/lib/natural/util/stopwords_fa').words,
+  ...sw.fas
+]);
+const stopwordsFr = new Set([
+  ...require('natural/lib/natural/util/stopwords_fr').words,
+  ...sw.fra
+]);
+const stopwordsId = new Set([
+  ...require('natural/lib/natural/util/stopwords_id').words,
+  ...sw.ind
+]);
+const stopwordsJa = new Set([
+  ...require('natural/lib/natural/util/stopwords_ja').words,
+  ...sw.jpn
+]);
+const stopwordsIt = new Set([
+  ...require('natural/lib/natural/util/stopwords_it').words,
+  ...sw.ita
+]);
+const stopwordsNl = new Set([
+  ...require('natural/lib/natural/util/stopwords_nl').words,
+  ...sw.nld
+]);
+const stopwordsNo = new Set([
+  ...require('natural/lib/natural/util/stopwords_no').words,
+  ...sw.nob
+]);
+const stopwordsPl = new Set([
+  ...require('natural/lib/natural/util/stopwords_pl').words,
+  ...sw.pol
+]);
+const stopwordsPt = new Set([
+  ...require('natural/lib/natural/util/stopwords_pt').words,
+  ...sw.por,
+  ...sw.porBr
+]);
+const stopwordsRu = new Set([
+  ...require('natural/lib/natural/util/stopwords_ru').words,
+  ...sw.rus
+]);
+const stopwordsSv = new Set([
+  ...require('natural/lib/natural/util/stopwords_sv').words,
+  ...sw.swe
+]);
+const stopwordsZh = new Set([
+  ...require('natural/lib/natural/util/stopwords_zh').words,
+  ...sw.zho
+]);
+const stopwordsRon = new Set(sw.ron);
+const stopwordsTur = new Set(sw.tur);
+const stopwordsVie = new Set(sw.vie);
+const stopwordsDeu = new Set(sw.deu);
+const stopwordsHun = new Set(sw.hun);
+const stopwordsAra = new Set(sw.ara);
+const stopwordsDan = new Set(sw.dan);
+const stopwordsFin = new Set(sw.fin);
+// TODO: add stopword pairing for these langs:
+// afr
+// ben
+// bre
+// bul
+// cat
+// ces
+// ell
+// epo
+// est
+// eus
+// fra
+// gle
+// glg
+// guj
+// hau
+// heb
+// hin
+// hrv
+// hye
+// kor
+// kur
+// lat
+// lav
+// lgg
+// lggNd
+// lit
+// mar
+// msa
+// mya
+// panGu
+// slk
+// slv
+// som
+// sot
+// swa
+// tgl
+// tha
+// ukr
+// urd
+// yor
+// zul
 // <https://stackoverflow.com/a/41353282>
 // <https://www.ietf.org/rfc/rfc3986.txt>
@@ -92,19 +190,18 @@ const ENDING_RESERVED_REGEX = new RE2(
 const PKG = require('./package.json');
-const VOCABULARY_LIMIT = require('./vocabulary-limit');
+const VOCABULARY_LIMIT = require('./vocabulary-limit.js');
-const ISO_CODE_MAPPING = require('./iso-code-mapping');
+// TODO: convert this into a Map
+const ISO_CODE_MAPPING = require('./iso-code-mapping.json');
 // <https://kb.smarshmail.com/Article/23567>
-const EXECUTABLES = require('./executables');
+const EXECUTABLES = new Set(require('./executables.json'));
-const REPLACEMENT_WORDS = require('./replacement-words');
+const REPLACEMENT_WORDS = require('./replacement-words.json');
 const locales = new Set(i18nLocales.map((l) => l.toLowerCase()));
-const readFile = promisify(fs.readFile);
 const normalizeUrlOptions = {
   stripProtocol: true,
   stripWWW: false,
@@ -154,7 +251,8 @@ for (const code of codes()) {
   const symbol = getSymbolFromCurrency(code);
   if (
     typeof symbol === 'string' &&
-    !currencySymbols.includes(symbol) &&
+    // eslint-disable-next-line unicorn/prefer-includes
+    currencySymbols.indexOf(symbol) === -1 &&
     !new RE2(/^[a-z]+$/i).test(symbol)
   )
     currencySymbols.push(escapeStringRegexp(symbol));
@@ -187,11 +285,13 @@ const isURLOptions = {
 class SpamScanner {
   constructor(config = {}) {
     this.config = {
-      debug: process.env.NODE_ENV === 'test',
+      debug:
+        process.env.NODE_ENV === 'test' ||
+        process.env.NODE_ENV === 'development',
       checkIDNHomographAttack: false,
       // note that if you attempt to train an existing `scanner.classifier`
       // then you will need to re-use these, so we suggest you store them
-      replacements: config.replacements || require('./replacements'),
+      replacements: config.replacements || require('./replacements.js'),
       // <https://nodemailer.com/extras/mailparser/>
       // NOTE: `iconv` package's Iconv cannot be used in worker threads
       // AND it can not also be shared in worker threads either (e.g. cloned)
@@ -203,7 +303,7 @@ class SpamScanner {
       // `wget --mirror --passive-ftp ftp://ftp.ietf.org/ietf-mail-archive`
       // `wget --mirror --passive-ftp ftp://ftp.ietf.org/concluded-wg-ietf-mail-archive`
       // (spam dataset is private at the moment)
-      classifier: config.classifier || require('./get-classifier'),
+      classifier: config.classifier || require('./get-classifier.js'),
       // default locale validated against i18n-locales
       locale: 'en',
       // we recommend to use axe/cabin, see https://cabinjs.com
@@ -310,10 +410,17 @@ class SpamScanner {
         allowedAttributes: false
       },
       userAgent: `${PKG.name}/${PKG.version}`,
-      timeout: ms('5s'),
+      timeout: ms('10s'),
       clamscan: {
+        debugMode:
+          process.env.NODE_ENV === 'test' ||
+          process.env.NODE_ENV === 'development',
+        clamscan: {
+          path: which.sync('clamscan', { nothrow: true })
+        },
         clamdscan: {
           timeout: ms('10s'),
+          path: which.sync('clamdscan', { nothrow: true }),
           socket: macosVersion.isMacOS
             ? '/tmp/clamd.socket'
             : '/var/run/clamav/clamd.ctl'
@@ -416,7 +523,6 @@ class SpamScanner {
         // cache in the background
         this.config.client
           .set(key, `${isAdult}:${isMalware}`, 'PX', this.config.ttlMs)
-          // eslint-disable-next-line promise/prefer-await-to-then
           .then(this.config.logger.info)
           .catch(this.config.logger.error);
         return { isAdult, isMalware };
@@ -431,6 +537,27 @@ class SpamScanner {
       throw new Error(
         `Locale of ${this.config.locale} was not valid according to locales list.`
       );
+    //
+    // set up regex helpers
+    //
+    this.EMAIL_REPLACEMENT_REGEX = new RE2(this.config.replacements.email, 'g');
+    const replacementRegexes = [];
+    for (const key of Object.keys(this.config.replacements)) {
+      replacementRegexes.push(
+        escapeStringRegexp(this.config.replacements[key])
+      );
+    }
+    this.REPLACEMENTS_REGEX = new RE2(
+      new RegExp(replacementRegexes.join('|'), 'g')
+    );
+    //
+    // set up helper Map and Sets for fast lookup
+    // (Set.has is 2x faster than includes, and 50% faster than indexOf)
+    //
+    this.WHITELISTED_WORDS = new Set(Object.values(this.config.replacements));
   }
   getHostname(link) {
@@ -520,17 +647,12 @@ class SpamScanner {
             const stream = isStream(attachment.content)
               ? attachment.content
               : intoStream(attachment.content);
-            const {
-              is_infected: isInfected,
-              viruses
-            } = await clamscan.scan_stream(stream);
+            const { isInfected, viruses } = await clamscan.scanStream(stream);
             const name = isSANB(attachment.filename)
               ? `"${attachment.filename}"`
               : `#${i + 1}`;
             if (isInfected)
-              messages.push(
-                `Attachment ${name} was infected with "${viruses}".`
-              );
+              messages.push(`Attachment ${name} was infected with ${viruses}.`);
           } catch (err) {
             this.config.logger.error(err);
           }
@@ -548,13 +670,16 @@ class SpamScanner {
     let gtube = false;
-    if (isSANB(mail.html) && mail.html.includes(GTUBE)) gtube = true;
+    // eslint-disable-next-line unicorn/prefer-includes
+    if (isSANB(mail.html) && mail.html.indexOf(GTUBE) !== -1) gtube = true;
-    if (isSANB(mail.text) && !gtube && mail.text.includes(GTUBE)) gtube = true;
+    // eslint-disable-next-line unicorn/prefer-includes
+    if (isSANB(mail.text) && !gtube && mail.text.indexOf(GTUBE) !== -1)
+      gtube = true;
     if (gtube)
       messages.push(
-        'Message detected to contain the GTUBE test from <https://spamassassin.apache.org/gtube/>.'
+        'Message detected to contain the GTUBE test from https://spamassassin.apache.org/gtube/.'
       );
     return messages;
@@ -597,9 +722,8 @@ class SpamScanner {
           records[0] === '0.0.0.0'
         );
       } catch (err) {
-        this.config.logger.error(err);
-        // return true if there is an error with DNS lookups
-        return true;
+        this.config.logger.warn(err);
+        return false;
       }
     }
   }
@@ -621,8 +745,6 @@ class SpamScanner {
     //
     // However we don't recommend this and therefore have our servers set to standard Cloudflare DNS
     //
-    // TODO: we need to do two lookups in parallel, one against adult and one against malware
-    //       and also make sure the messages aren't duplicated when we concatenate final array of messages
     const [isAdult, isMalware] = await Promise.all([
       this.malwareLookup('https://family.cloudflare-dns.com/dns-query', name),
       this.malwareLookup('https://security.cloudflare-dns.com/dns-query', name)
@@ -744,14 +866,14 @@ class SpamScanner {
         })
         .match(URL_REGEX) || [];
-    const array = [];
+    const array = new Set();
     for (const url of urls) {
       const normalized = this.getNormalizedUrl(url);
-      if (normalized && !array.includes(normalized)) array.push(normalized);
+      if (normalized) array.add(normalized);
     }
-    return array;
+    return [...array];
   }
   parseLocale(locale) {
@@ -765,12 +887,6 @@ class SpamScanner {
   // <https://github.com/NaturalNode/natural#stemmers>
   // eslint-disable-next-line complexity
   async getTokens(string, locale, isHTML = false) {
-    // get the current email replacement regex
-    const EMAIL_REPLACEMENT_REGEX = new RE2(
-      this.config.replacements.email,
-      'g'
-    );
     //
     // parse HTML for <html> tag with lang attr
     // otherwise if that wasn't found then look for this
@@ -818,17 +934,6 @@ class SpamScanner {
     if (isHTML) string = sanitizeHtml(string, this.config.sanitizeHtml);
-    const replacementRegexes = [];
-    for (const key of Object.keys(this.config.replacements)) {
-      replacementRegexes.push(
-        escapeStringRegexp(this.config.replacements[key])
-      );
-    }
-    const REPLACEMENTS_REGEX = new RE2(
-      new RegExp(replacementRegexes.join('|'), 'g')
-    );
     string = striptags(string, [], ' ')
       .trim()
       // replace newlines
@@ -837,7 +942,7 @@ class SpamScanner {
       // attackers may try to inject our replacements into the message
       // therefore we should strip all of them before doing any replacements
       //
-      .replace(REPLACEMENTS_REGEX, ' ');
+      .replace(this.REPLACEMENTS_REGEX, ' ');
     //
     // we should instead use language detection to determine
@@ -855,7 +960,8 @@ class SpamScanner {
     locale = this.parseLocale(isSANB(locale) ? locale : this.config.locale);
-    if (!locales.has(locale)) {
+    // NOTE: "in" and "po" are valid locales but not from i18n
+    if (!locales.has(locale) && locale !== 'in' && locale !== 'po') {
       debug(`Locale ${locale} was not valid and will use default`);
       locale = this.parseLocale(this.config.locale);
     }
@@ -867,103 +973,145 @@ class SpamScanner {
     let stopwords = stopwordsEn;
     let language = 'english';
     let stemword = 'default';
     switch (locale) {
       case 'ar':
+        // arb
+        // ISO 639-3 = ara
+        stopwords = stopwordsAra;
         language = 'arabic';
         break;
       case 'da':
+        // dan
         language = 'danish';
+        stopwords = stopwordsDan;
         break;
       case 'nl':
+        // nld
         stopwords = stopwordsNl;
         language = 'dutch';
         break;
       case 'en':
+        // eng
         language = 'english';
         break;
       case 'fi':
+        // fin
         language = 'finnish';
         tokenizer = orthographyTokenizer;
+        stopwords = stopwordsFin;
         break;
       case 'fa':
+        // fas (Persian/Farsi)
         language = 'farsi';
         tokenizer = aggressiveTokenizerFa;
         stopwords = stopwordsFa;
         stemword = natural.PorterStemmerFa.stem.bind(natural.PorterStemmerFa);
         break;
       case 'fr':
+        // fra
         language = 'french';
         tokenizer = aggressiveTokenizerFr;
         stopwords = stopwordsFr;
         break;
       case 'de':
+        // deu
         language = 'german';
+        stopwords = stopwordsDeu;
         break;
       case 'hu':
+        // hun
         language = 'hungarian';
+        stopwords = stopwordsHun;
         break;
       case 'in':
+        // ind
         language = 'indonesian';
         tokenizer = aggressiveTokenizerId;
         stopwords = stopwordsId;
         break;
       case 'it':
+        // ita
         language = 'italian';
         tokenizer = aggressiveTokenizerIt;
         stopwords = stopwordsIt;
         break;
       case 'ja':
+        // jpn
         tokenizer = tokenizerJa;
         stopwords = stopwordsJa;
         stemword = natural.StemmerJa.stem.bind(natural.StemmerJa);
         break;
       case 'nb':
+        // nob
+        language = 'norwegian';
+        tokenizer = aggressiveTokenizerNo;
+        stopwords = stopwordsNo;
+        break;
       case 'nn':
+        // nno
+        // ISO 639-3 = nob
         language = 'norwegian';
         tokenizer = aggressiveTokenizerNo;
         stopwords = stopwordsNo;
         break;
       case 'po':
+        // pol
         language = 'polish';
         tokenizer = aggressiveTokenizerPl;
         stopwords = stopwordsPl;
         stemword = false;
         break;
       case 'pt':
+        // por
         language = 'portuguese';
         tokenizer = aggressiveTokenizerPt;
         stopwords = stopwordsPt;
         break;
       case 'es':
+        // spa
         language = 'spanish';
         tokenizer = aggressiveTokenizerEs;
         stopwords = stopwordsEs;
         break;
       case 'sv':
+        // swe
         language = 'swedish';
         tokenizer = aggressiveTokenizerSv;
         stopwords = stopwordsSv;
         break;
       case 'ro':
+        // ron
         language = 'romanian';
+        stopwords = stopwordsRon;
         break;
       case 'ru':
+        // rus
         language = 'russian';
         tokenizer = aggressiveTokenizerRu;
         stopwords = stopwordsRu;
         break;
       case 'ta':
+        // tam
+        // NOTE: no stopwords available
         language = 'tamil';
         break;
       case 'tr':
+        // tur
         language = 'turkish';
+        stopwords = stopwordsTur;
         break;
       case 'vi':
+        // vie
         language = 'vietnamese';
         tokenizer = aggressiveTokenizerVi;
+        stopwords = stopwordsVie;
         stemword = false;
         break;
       case 'zh':
+        // cmn
+        // TODO: use this instead https://github.com/yishn/chinese-tokenizer
+        // ISO 639-3 = zho (Chinese, Macrolanguage)
         language = 'chinese';
         stopwords = stopwordsZh;
         stemword = false;
@@ -981,7 +1129,7 @@ class SpamScanner {
       string
         .split(' ')
         .map((_string) =>
-          _string.startsWith(':') &&
+          _string.indexOf(':') === 0 &&
           _string.endsWith(':') &&
           typeof toEmoji[_string.slice(1, -1)] === 'string'
             ? toEmoji[_string.slice(1, -1)]
@@ -1029,7 +1177,10 @@ class SpamScanner {
         // now we ensure that URL's and EMAIL's are properly spaced out
         // (e.g. in case ?email=some@email.com was in a URL)
-        .replace(EMAIL_REPLACEMENT_REGEX, ` ${this.config.replacements.email} `)
+        .replace(
+          this.EMAIL_REPLACEMENT_REGEX,
+          ` ${this.config.replacements.email} `
+        )
         // TODO: replace file paths, file dirs, dotfiles, and dotdirs
@@ -1044,12 +1195,14 @@ class SpamScanner {
         // replace currency
         .replace(CURRENCY_REGEX, ` ${this.config.replacements.currency} `);
+    //
     // expand contractions so "they're" -> [ they, are ] vs. [ they, re ]
     // <https://github.com/NaturalNode/natural/issues/533>
-    if (locale === 'en') string = contractions.expand(string);
-    // whitelist exclusions
-    const whitelistedWords = Object.values(this.config.replacements);
+    //
+    // NOTE: we're doing this for all languages now, not just en
+    // if (locale === 'en')
+    //
+    string = contractions.expand(string);
     //
     // Future research:
@@ -1063,43 +1216,32 @@ class SpamScanner {
     for (const token of tokenizer.tokenize(string.toLowerCase())) {
       // whitelist words from being stemmed (safeguard)
       if (
-        whitelistedWords.includes(token) ||
-        token.startsWith(this.config.replacements.initialism) ||
-        token.startsWith(this.config.replacements.abbrevation)
+        this.WHITELISTED_WORDS.has(token) ||
+        token.indexOf(this.config.replacements.initialism) === 0 ||
+        token.indexOf(this.config.replacements.abbrevation) === 0
       ) {
         tokens.push(token);
         continue;
       }
-      if (
-        stopwords.includes(token) ||
-        (sw[locale] && sw[locale].includes(token)) ||
-        (locale !== 'en' &&
-          (stopwordsEn.includes(token) || sw.en.includes(token)))
-      )
+      if (stopwords.has(token) || (locale !== 'en' && stopwordsEn.has(token))) {
         continue;
+      }
       // locale specific stopwords to ignore
       let localeStem;
       if (typeof stemword === 'function') {
         localeStem = stemword(token);
-        if (
-          localeStem &&
-          (stopwords.includes(localeStem) ||
-            (sw[locale] && sw[locale].includes(localeStem)))
-        )
+        if (localeStem && stopwords.has(localeStem)) {
           continue;
+        }
       }
       // always check against English stemwords
       let englishStem;
       if (locale !== 'en') {
         englishStem = snowball.stemword(token, 'english');
-        if (
-          englishStem &&
-          (stopwordsEn.includes(englishStem) || sw.en.includes(englishStem))
-        )
-          continue;
+        if (englishStem && stopwordsEn.has(englishStem)) continue;
       }
       tokens.push(
@@ -1107,6 +1249,8 @@ class SpamScanner {
       );
     }
+    debug('locale', locale, 'tokens', tokens);
     if (this.config.debug) return tokens;
     // we should sha256 all tokens with hasha if not in debug mode
@@ -1119,7 +1263,7 @@ class SpamScanner {
     let source = string;
     if (isBuffer(string)) source = string.toString();
     else if (typeof string === 'string' && isValidPath(string))
-      source = await readFile(string);
+      source = await fs.promises.readFile(string);
     const tokens = [];
     const mail = await simpleParser(source, this.config.simpleParser);
@@ -1157,12 +1301,11 @@ class SpamScanner {
   // eslint-disable-next-line complexity
   async getPhishingResults(mail) {
-    const messages = [];
+    const messages = new Set();
     //
     // NOTE: all links pushed are lowercased
     //
-    const links = [];
+    const links = new Set();
     // parse <a> tags with different org domain in text vs the link
     if (isSANB(mail.html)) {
@@ -1172,7 +1315,7 @@ class SpamScanner {
       // elements concatenate to form a URL which is malicious or phishing
       //
       for (const link of this.getUrls(striptags(mail.html, [], ' ').trim())) {
-        if (!links.includes(link)) links.push(link);
+        links.add(link);
       }
       //
@@ -1214,7 +1357,7 @@ class SpamScanner {
             // (this is needed because some have "Web:%20http://google.com" for example in href tags)
             [href] = this.getUrls(href);
             // eslint-disable-next-line max-depth
-            if (href && !links.includes(href)) links.push(href);
+            if (href) links.add(href);
           }
           // the text content could contain multiple URL's
@@ -1224,18 +1367,17 @@ class SpamScanner {
             isSANB(href) &&
             validator.isURL(href, isURLOptions)
           ) {
-            const string = `Anchor link with href of "${href}" and inner text value of "${textContent}"`;
+            const string = `Anchor link with href of ${href} and inner text value of "${textContent}"`;
             // eslint-disable-next-line max-depth
             if (this.config.checkIDNHomographAttack) {
               const anchorUrlHostname = this.getHostname(href);
               // eslint-disable-next-line max-depth
               if (anchorUrlHostname) {
-                const anchorUrlHostnameToASCII = punycode.toASCII(
-                  anchorUrlHostname
-                );
+                const anchorUrlHostnameToASCII =
+                  punycode.toASCII(anchorUrlHostname);
                 // eslint-disable-next-line max-depth
-                if (anchorUrlHostnameToASCII.startsWith('xn--'))
-                  messages.push(
+                if (anchorUrlHostnameToASCII.indexOf('xn--') === 0)
+                  messages.add(
                     `${string} has possible IDN homograph attack from anchor hostname.`
                   );
               }
@@ -1244,20 +1386,19 @@ class SpamScanner {
             // eslint-disable-next-line max-depth
             for (const link of this.getUrls(textContent)) {
               // this link should have already been included but just in case
-              // eslint-disable-next-line max-depth
-              if (!links.includes(link)) links.push(link);
+              links.add(link);
               // eslint-disable-next-line max-depth
               if (this.config.checkIDNHomographAttack) {
                 const innerTextUrlHostname = this.getHostname(link);
                 // eslint-disable-next-line max-depth
                 if (innerTextUrlHostname) {
-                  const innerTextUrlHostnameToASCII = punycode.toASCII(
-                    innerTextUrlHostname
-                  );
+                  const innerTextUrlHostnameToASCII =
+                    punycode.toASCII(innerTextUrlHostname);
                   // eslint-disable-next-line max-depth
-                  if (innerTextUrlHostnameToASCII.startsWith('xn--'))
-                    messages.push(
+                  if (innerTextUrlHostnameToASCII.indexOf('xn--') === 0)
+                    messages.add(
                       `${string} has possible IDN homograph attack from inner text hostname.`
                     );
                 }
@@ -1273,49 +1414,46 @@ class SpamScanner {
     for (const prop of MAIL_PHISHING_PROPS) {
       if (isSANB(mail[prop])) {
         for (const link of this.getUrls(mail[prop])) {
-          if (!links.includes(link)) links.push(link);
+          links.add(link);
         }
       }
     }
-    for (const link of links) {
-      const urlHostname = this.getHostname(link);
-      if (urlHostname) {
-        const toASCII = punycode.toASCII(urlHostname);
-        if (toASCII.startsWith('xn--'))
-          messages.push(
-            `Possible IDN homograph attack from link of "${link}" with punycode converted hostname of "${toASCII}".`
-          );
+    if (this.config.checkIDNHomographAttack) {
+      for (const link of links) {
+        const urlHostname = this.getHostname(link);
+        if (urlHostname) {
+          const toASCII = punycode.toASCII(urlHostname);
+          if (toASCII.indexOf('xn--') === 0)
+            messages.add(
+              `Possible IDN homograph attack from link of ${link} with punycode converted hostname of ${toASCII}.`
+            );
+        }
       }
     }
     // check against Cloudflare malware/phishing/adult DNS lookup
     // if it returns `0.0.0.0` it means it was flagged
     await Promise.all(
-      links.map(async (link) => {
+      [...links].map(async (link) => {
         try {
           const urlHostname = this.getHostname(link);
           if (urlHostname) {
             const toASCII = punycode.toASCII(urlHostname);
-            const adultMessage = `Link hostname of "${toASCII}" was detected by Cloudflare's Family DNS to contain adult-related content, phishing, and/or malware.`;
-            const malwareMessage = `Link hostname of ${toASCII}" was detected by Cloudflare's Security DNS to contain phishing and/or malware.`;
+            const adultMessage = `Link hostname of ${toASCII} was detected by Cloudflare's Family DNS to contain adult-related content, phishing, and/or malware.`;
+            const malwareMessage = `Link hostname of ${toASCII} was detected by Cloudflare's Security DNS to contain phishing and/or malware.`;
             // if it already included both messages then return early
-            if (
-              messages.includes(adultMessage) &&
-              messages.includes(malwareMessage)
-            )
+            if (messages.has(adultMessage) && messages.has(malwareMessage))
               return;
-            const {
-              isAdult,
-              isMalware
-            } = await this.memoizedIsCloudflareBlocked(toASCII);
+            const { isAdult, isMalware } =
+              await this.memoizedIsCloudflareBlocked(toASCII);
-            if (isAdult && !messages.includes(adultMessage))
-              messages.push(adultMessage);
-            if (isMalware && !messages.includes(malwareMessage))
-              messages.push(malwareMessage);
+            if (isAdult && !messages.has(adultMessage))
+              messages.add(adultMessage);
+            if (isMalware && !messages.has(malwareMessage))
+              messages.add(malwareMessage);
           }
         } catch (err) {
           this.config.logger.error(err);
@@ -1323,7 +1461,7 @@ class SpamScanner {
       })
     );
-    return { messages, links };
+    return { messages: [...messages], links: [...links] };
   }
   // getNSFWResults() {
@@ -1344,7 +1482,7 @@ class SpamScanner {
           try {
             const fileType = await FileType.fromBuffer(attachment.content);
-            if (fileType && fileType.ext && EXECUTABLES.includes(fileType.ext))
+            if (fileType && fileType.ext && EXECUTABLES.has(fileType.ext))
               messages.push(
                 `Attachment's "magic number" indicated it was a dangerous executable with a ".${fileType.ext}" extension.`
               );
@@ -1359,7 +1497,7 @@ class SpamScanner {
             punycode.toUnicode(attachment.filename.split('?')[0])
           );
           const ext = fileExtension(filename);
-          if (ext && EXECUTABLES.includes(ext))
+          if (ext && EXECUTABLES.has(ext))
             messages.push(
               `Attachment's file name indicated it was a dangerous executable with a ".${ext}" extension.`
             );
@@ -1367,7 +1505,7 @@ class SpamScanner {
         if (isSANB(attachment.contentType)) {
           const ext = mime.extension(attachment.contentType);
-          if (isSANB(ext) && EXECUTABLES.includes(ext))
+          if (isSANB(ext) && EXECUTABLES.has(ext))
             messages.push(
               `Attachment's Content-Type was a dangerous executable with a ".${ext}" extension.`
             );

package/package.json CHANGED Viewed

@@ -1,22 +1,12 @@
 {
   "name": "spamscanner",
   "description": "Spam Scanner - The Best Anti-Spam Scanning Service and Anti-Spam API",
-  "version": "3.0.6",
+  "version": "5.0.0",
   "author": "Niftylettuce, LLC. <niftylettuce@gmail.com> (https://niftylettuce.com/)",
-  "ava": {
-    "timeout": "30s",
-    "verbose": true,
-    "serial": true
-  },
   "bugs": {
     "url": "https://github.com/spamscanner/spamscanner/issues",
     "email": "niftylettuce@gmail.com"
   },
-  "commitlint": {
-    "extends": [
-      "@commitlint/config-conventional"
-    ]
-  },
   "contributors": [
     "Nick Baugh <niftylettuce@gmail.com> (http://niftylettuce.com/)",
     "Shaun Warman <shaunwarman1@gmail.com> (http://shaunwarman.com/)"
@@ -24,82 +14,81 @@
   "dependencies": {
     "@ladjs/naivebayes": "^0.1.0",
     "bitcoin-regex": "^2.0.0",
-    "clamscan": "^1.3.3",
+    "clamscan": "^2.1.2",
     "credit-card-regex": "^3.0.0",
-    "crypto-random-string": "^3.3.1",
+    "crypto-random-string": "3",
     "currency-codes": "^2.1.0",
-    "currency-symbol-map": "^5.0.1",
-    "debug": "^4.3.1",
+    "currency-symbol-map": "^5.1.0",
     "email-regex-safe": "^1.0.2",
-    "emoji-patterns": "^13.1.0",
-    "escape-string-regexp": "^4.0.0",
+    "emoji-patterns": "^14.0.1",
+    "escape-string-regexp": "4",
     "expand-contractions": "^1.0.1",
     "file-extension": "^4.0.5",
-    "file-type": "^16.2.0",
+    "file-type": "16",
     "floating-point-regex": "^0.1.0",
-    "franc": "^5.0.0",
-    "gemoji": "^6.1.0",
+    "franc": "5",
+    "gemoji": "6",
     "hasha": "^5.2.2",
     "hexa-color-regex": "^1.0.0",
-    "i18n-locales": "^0.0.4",
-    "iconv": "^3.0.0",
-    "into-stream": "^6.0.0",
-    "ip-regex": "^4.3.0",
+    "i18n-locales": "^0.0.5",
+    "iconv": "^3.0.1",
+    "into-stream": "6",
+    "ip-regex": "4",
     "is-buffer": "^2.0.5",
-    "is-stream": "^2.0.0",
+    "is-stream": "2",
     "is-string-and-not-blank": "^0.0.2",
     "is-valid-path": "^0.1.1",
     "mac-regex": "^1.0.0",
-    "macos-version": "^5.2.1",
-    "mailparser": "^3.0.1",
+    "macos-version": "5",
+    "mailparser": "^3.5.0",
     "memoizee": "^0.4.15",
-    "mime-types": "^2.1.28",
+    "mime-types": "^2.1.35",
     "ms": "^2.1.3",
-    "natural": "^4.0.0",
+    "natural": "^5.2.2",
     "newline-remove": "^1.0.2",
-    "node-html-parser": "^2.1.0",
+    "node-html-parser": "4",
     "node-snowball": "^0.6.0",
-    "normalize-url": "^5.3.0",
-    "parse-domain": "^3.0.3",
+    "normalize-url": "5",
+    "parse-domain": "5",
     "phone-regex": "^2.1.0",
     "punycode": "^2.1.1",
-    "re2": "^1.15.9",
-    "sanitize-html": "^2.3.2",
-    "stopword": "^1.0.6",
-    "striptags": "^3.1.1",
-    "superagent": "^6.1.0",
+    "re2": "^1.17.6",
+    "sanitize-html": "^2.7.0",
+    "stopword": "^2.0.2",
+    "striptags": "^3.2.0",
+    "superagent": "^7.1.6",
     "trim-leading-whitespace": "^0.1.1",
     "universalify": "^2.0.0",
-    "url-regex-safe": "^2.0.2",
-    "validator": "^13.5.2"
+    "url-regex-safe": "^3.0.0",
+    "validator": "^13.7.0",
+    "which": "^2.0.2"
   },
   "devDependencies": {
-    "@commitlint/cli": "^11.0.0",
-    "@commitlint/config-conventional": "^11.0.0",
+    "@commitlint/cli": "^17.0.2",
+    "@commitlint/config-conventional": "^17.0.2",
     "@ladjs/redis": "^1.0.7",
-    "ava": "^3.15.0",
-    "codecov": "^3.8.1",
+    "ava": "^4.3.0",
     "cross-env": "^7.0.3",
     "delay": "^5.0.0",
-    "eslint": "^7.20.0",
-    "eslint-config-xo-lass": "^1.0.5",
+    "eslint": "^8.17.0",
+    "eslint-config-xo-lass": "^2.0.1",
     "fixpack": "^4.0.0",
-    "husky": "^5.0.9",
-    "is-ci": "^2.0.0",
-    "lint-staged": "^10.5.4",
-    "lookpath": "^1.1.0",
+    "husky": "^8.0.1",
+    "is-ci": "^3.0.1",
+    "lint-staged": "^13.0.1",
+    "lookpath": "^1.2.2",
     "make-dir": "^3.1.0",
     "node-mbox": "^1.0.0",
     "numeral": "^2.0.6",
     "nyc": "^15.1.0",
-    "p-map": "^4.0.0",
+    "p-map": "4",
     "read-dir-deep": "^7.0.1",
-    "remark-cli": "^9.0.0",
-    "remark-preset-github": "^4.0.1",
-    "xo": "^0.37.1"
+    "remark-cli": "^10.0.1",
+    "remark-preset-github": "^4.0.4",
+    "xo": "^0.50.0"
   },
   "engines": {
-    "node": ">=12.11.0"
+    "node": ">=14"
   },
   "files": [
     "package.json",
@@ -114,12 +103,6 @@
     "classifier.json"
   ],
   "homepage": "https://github.com/spamscanner/spamscanner",
-  "husky": {
-    "hooks": {
-      "pre-commit": "lint-staged",
-      "commit-msg": "commitlint -E HUSKY_GIT_PARAMS"
-    }
-  },
   "keywords": [
     "adult",
     "api",
@@ -172,38 +155,17 @@
   ],
   "license": "Business Source License 1.1",
   "main": "index.js",
-  "prettier": {
-    "singleQuote": true,
-    "bracketSpacing": true,
-    "trailingComma": "none"
-  },
-  "remarkConfig": {
-    "plugins": [
-      "preset-github"
-    ]
-  },
   "repository": {
     "type": "git",
     "url": "https://github.com/spamscanner/spamscanner"
   },
   "scripts": {
     "ava": "cross-env NODE_ENV=test ava",
-    "coverage": "nyc report --reporter=text-lcov > coverage.lcov && codecov",
-    "lint": "xo && remark . -qfo",
+    "lint": "xo --fix && remark . -qfo && fixpack",
     "nyc": "cross-env NODE_ENV=test nyc ava",
-    "test": "npm run lint && npm run ava",
+    "prepare": "husky install",
+    "pretest": "npm run lint",
+    "test": "npm run test-coverage",
     "test-coverage": "npm run lint && npm run nyc"
-  },
-  "xo": {
-    "prettier": true,
-    "space": true,
-    "extends": [
-      "xo-lass"
-    ],
-    "ignores": [
-      "data",
-      "classifier.json",
-      "bag-of-words.json"
-    ]
   }
 }

package/vocabulary-limit.js CHANGED Viewed

@@ -1,5 +1,7 @@
+const process = require('process');
 module.exports =
   typeof process.env.VOCABULARY_LIMIT !== 'undefined' &&
   Number.isFinite(Number.parseInt(process.env.VOCABULARY_LIMIT, 10))
     ? Number.parseInt(process.env.VOCABULARY_LIMIT, 10)
-    : 20000;
+    : 20_000;