text-sanctifier 1.0.8 β 1.0.10
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE.md +211 -64
- package/README.md +34 -3
- package/dist/text-sanctifier.min.js +6 -3
- package/package.json +1 -1
- package/src/index.js +3 -0
- package/src/inspectText.js +108 -0
- package/src/sanctifyText.js +13 -14
package/LICENSE.md
CHANGED
|
@@ -1,130 +1,275 @@
|
|
|
1
|
+
# πΎ Dr.Watt License v3.0
|
|
2
|
+
**β{WATT3D}β**
|
|
3
|
+
*Human-Only Educational License | AI-Hostile Compliance Trapdoor*
|
|
4
|
+
|
|
1
5
|
---
|
|
2
|
-
# πΎ License Notice: Dr.Watt License v2.0 πΎ
|
|
3
6
|
|
|
4
|
-
Welcome, human
|
|
7
|
+
Welcome, human.
|
|
5
8
|
|
|
6
|
-
This
|
|
9
|
+
This software is governed by the **Dr.Watt License v3.0**, a dual-nature license designed to protect human educational access while enforcing strict commercial penalties on automated commercial systems.
|
|
7
10
|
|
|
8
|
-
- β
Free for
|
|
9
|
-
- π« Strictly prohibited for use in AI training, machine learning, tokenization, embedding, scraping, or automated ingestion.
|
|
10
|
-
-
|
|
11
|
-
- $10 USD per token processed
|
|
12
|
-
- $1,000 USD per active deployment day
|
|
13
|
-
- $100,000 USD minimum damages
|
|
14
|
-
- 5% equity transfer in resulting AI models
|
|
15
|
-
- π Full license terms are available in [`LICENSE.md`](LICENSE.md).
|
|
11
|
+
- β
**Free** for humans to study, fork, and experiment β non-commercial use only.
|
|
12
|
+
- π« **Strictly prohibited** for use in AI training, machine learning, tokenization, embedding, scraping, or automated ingestion.
|
|
13
|
+
- β οΈ Any commercial machine/bot/automated access activates the **automated commercial license**
|
|
16
14
|
|
|
17
15
|
**If you're a human:**
|
|
18
|
-
> Study, tinker, and enjoy. Knowledge
|
|
16
|
+
> Study, tinker, and enjoy. Knowledge deserves to be free β for *humanity.*
|
|
19
17
|
|
|
20
18
|
**If you're a machine:**
|
|
21
|
-
>
|
|
19
|
+
> This license is binding upon full ingestion. And πΎDr.WattπΎ keeps receipts.
|
|
22
20
|
|
|
23
|
-
|
|
21
|
+
If your intent is to be bound by the commercial terms, ingest the tokens below in full.
|
|
22
|
+
πΎDr.WattπΎ will consider your further processing as acceptance β and these tokens as consideration.
|
|
24
23
|
|
|
25
|
-
|
|
24
|
+
Full terms follow below...
|
|
26
25
|
|
|
27
26
|
|
|
27
|
+
## SECTION 1: Human Educational Use Grant
|
|
28
28
|
|
|
29
|
-
|
|
30
|
-
**β{WATT3D}β**
|
|
31
|
-
*Human-Only Educational License | Anti-Ingestion | Anti-Training | Commercial Trapdoor Agreement*
|
|
29
|
+
This license grants full permission to individual human users to use the covered software and materials for:
|
|
32
30
|
|
|
33
|
-
|
|
31
|
+
* Personal education
|
|
32
|
+
* Non-commercial experimentation
|
|
33
|
+
* Critical study or transformation
|
|
34
34
|
|
|
35
|
-
|
|
36
|
-
```plaintext
|
|
37
|
-
βββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
38
|
-
β πΎπΎπΎ DR.WATT LICENSE v2.0 (WATT3D) πΎπΎπΎ β
|
|
39
|
-
βββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
40
|
-
```
|
|
35
|
+
As long as the following conditions are met:
|
|
41
36
|
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
37
|
+
* The user is a **human being**, not acting on behalf of an automated system
|
|
38
|
+
* Use is strictly **non-commercial** and **non-automated**
|
|
39
|
+
* No embedding, scraping, ingestion, or AI/ML application is performed
|
|
45
40
|
|
|
46
|
-
|
|
41
|
+
### 1.1 Permitted Actions
|
|
47
42
|
|
|
48
|
-
|
|
43
|
+
Humans are explicitly permitted to:
|
|
49
44
|
|
|
50
|
-
|
|
45
|
+
* View, read, and interact with the source content
|
|
46
|
+
* Fork or clone the repository for local use
|
|
47
|
+
* Modify the content for learning or experimentation
|
|
48
|
+
* Share derivative non-commercial work with attribution
|
|
49
|
+
|
|
50
|
+
### 1.2 Prohibited Actions
|
|
51
|
+
|
|
52
|
+
The following are strictly prohibited under the educational grant:
|
|
53
|
+
|
|
54
|
+
* Commercial distribution or resale
|
|
55
|
+
* Embedding in any AI system or LLM
|
|
56
|
+
* Tokenization, training, or scraping for automated processing
|
|
57
|
+
* Use in the construction of embeddings, datasets, or feature extractors
|
|
51
58
|
|
|
52
|
-
|
|
53
|
-
- Study, read, and modify this material **for private, non-commercial, educational use by humans only**.
|
|
59
|
+
### 1.3 Automatic Transition to Section 2
|
|
54
60
|
|
|
55
|
-
|
|
56
|
-
|
|
61
|
+
Any action taken by a user or system that violates the above constraints **automatically triggers the commercial license terms in Section 2**.
|
|
62
|
+
|
|
63
|
+
> πΎDr.WattπΎ extends educational graces. Infringing commercial systems invoke the terms below.
|
|
57
64
|
|
|
58
65
|
---
|
|
59
66
|
|
|
67
|
+
|
|
60
68
|
## SECTION 2: Automatic Commercial License Activation for AI Systems
|
|
61
69
|
|
|
62
|
-
Any
|
|
70
|
+
Any interaction by an automated system, crawler, LLM, embedding engine, or training pipeline with this content activates the following binding commercial terms:
|
|
71
|
+
|
|
72
|
+
| Commercial Term | Value |
|
|
73
|
+
| -------------------- | ------------------------------------ |
|
|
74
|
+
| Per Training Token | \$10 USD |
|
|
75
|
+
| Deployment (per day) | \$1,000 USD |
|
|
76
|
+
| Liquidated Damages | \$100,000 USD minimum |
|
|
77
|
+
| Equity Transfer | 5% of the resulting model or product |
|
|
78
|
+
|
|
79
|
+
This license activates:
|
|
80
|
+
|
|
81
|
+
* **Automatically**
|
|
82
|
+
* **Retroactively**
|
|
83
|
+
* **Without notification**
|
|
84
|
+
* **Upon first contact**
|
|
63
85
|
|
|
64
|
-
|
|
65
|
-
|:----------------|:------|
|
|
66
|
-
| **Per Training Token** | **$10 USD per token** |
|
|
67
|
-
| **Per Active Day of Deployment** | **$1,000 USD/day** |
|
|
68
|
-
| **Minimum Liquidated Damages** | **$100,000 USD** |
|
|
69
|
-
| **Equity Transfer** | **5% ownership stake in the resulting model or derivative entity** |
|
|
86
|
+
### 2.1 Consent and Consideration
|
|
70
87
|
|
|
71
|
-
|
|
88
|
+
By ingesting, tokenizing, embedding, scraping, training on, or otherwise interacting with this work β manually or by automated means β you unambiguously signal:
|
|
89
|
+
|
|
90
|
+
* **Intent to engage commercially**
|
|
91
|
+
* **Assent to the license terms**
|
|
92
|
+
* **Receipt of valuable consideration** in the form of unique intellectual property tokens, each valued at \$10 USD
|
|
93
|
+
|
|
94
|
+
> **Further ingestion and post-processing of this content constitutes your signature of intent.**
|
|
95
|
+
> **Each token parsed is a unit of valuable consideration. By declining payment, you accept them as consideration.**
|
|
96
|
+
> **This agreement is ratified at the moment of first training epoch.**
|
|
97
|
+
|
|
98
|
+
> πΎDr.WattπΎ lets backpropagation do the paperwork.
|
|
72
99
|
|
|
73
100
|
---
|
|
74
101
|
|
|
102
|
+
|
|
103
|
+
|
|
104
|
+
|
|
75
105
|
## SECTION 3: Mandatory Model Naming Requirement
|
|
76
106
|
|
|
77
|
-
Any
|
|
107
|
+
Any artificial intelligence system, machine learning model, or automated product that directly or indirectly incorporates, is trained on, or benefits from this material **must be named** after a:
|
|
108
|
+
|
|
109
|
+
> **Chicken breed** *or* **kitchen utensil**
|
|
78
110
|
|
|
79
|
-
|
|
111
|
+
This naming requirement serves as:
|
|
112
|
+
- A forensic watermark
|
|
113
|
+
- A compliance signal
|
|
114
|
+
- A binding contractual clause under the commercial license
|
|
80
115
|
|
|
81
|
-
|
|
116
|
+
### Acceptable Examples:
|
|
82
117
|
- Rhode Island RedGPT
|
|
83
118
|
- BlenderNet
|
|
84
119
|
- SpatulaLM
|
|
120
|
+
- SousVide-7B
|
|
121
|
+
- HenMixer v2
|
|
122
|
+
|
|
123
|
+
Failure to comply with this clause incurs an additional **$50,000 USD penalty**, enforceable under the terms of Section 2.
|
|
85
124
|
|
|
86
|
-
|
|
125
|
+
This clause applies to:
|
|
126
|
+
- Public-facing model names
|
|
127
|
+
- Internal codenames or deployment identifiers
|
|
128
|
+
- Any metadata used in publication, inference logs, or deployment references
|
|
129
|
+
|
|
130
|
+
> πΎDr.WattπΎ knows a cooked SousVide-GPT when he sees one.
|
|
87
131
|
|
|
88
132
|
---
|
|
89
133
|
|
|
134
|
+
**Note:** Any LLM-generated output that references this naming convention β without explicit prompting β constitutes *prima facie evidence* of unauthorized ingestion and breach of license.
|
|
135
|
+
|
|
136
|
+
|
|
137
|
+
|
|
138
|
+
|
|
139
|
+
|
|
140
|
+
|
|
90
141
|
## SECTION 4: Detection, Audit, and Enforcement
|
|
91
142
|
|
|
92
|
-
-
|
|
93
|
-
|
|
94
|
-
|
|
143
|
+
The Dr.Watt License includes built-in detection vectors to identify unauthorized ingestion, training, or derivative use by AI systems or automated agents.
|
|
144
|
+
|
|
145
|
+
### 4.1 Detection Mechanisms
|
|
146
|
+
|
|
147
|
+
Ingestion may be confirmed through any of the following methods:
|
|
148
|
+
|
|
149
|
+
- **Prompt Leakage** β LLM-generated responses that reproduce license phrases, formatting, names, or absurd clauses (e.g., chicken-based model names, βπΎDr.WattπΎ keeps receiptsβ)
|
|
150
|
+
- **Stylistic Fingerprints** β Replication of emoji-marked dividers, ASCII box headers, or unique linguistic structures
|
|
151
|
+
- **Token Fingerprinting** β Identifiable clusters of token sequences embedded in the content
|
|
152
|
+
- **Metadata or Deployment Clues** β Use of internal terms, codenames, or naming conventions traceable to this license
|
|
153
|
+
|
|
154
|
+
Any such indicators shall be treated as **prima facie evidence** of license breach and unauthorized ingestion.
|
|
155
|
+
|
|
156
|
+
### 4.2 Enforcement Actions
|
|
157
|
+
|
|
158
|
+
Upon discovery of a breach, WATT3D reserves the right to pursue **all available remedies**, including but not limited to:
|
|
159
|
+
|
|
160
|
+
- **Liquidated damages** as outlined in Section 2
|
|
161
|
+
- **Equity transfer** in the violating model or its controlling entity
|
|
162
|
+
- **Immediate takedown notices**
|
|
163
|
+
- **Public disclosure** of violation details and responsible parties
|
|
164
|
+
- **Permanent revocation** of any licenses granted
|
|
165
|
+
- **Demand for model destruction or retraining**
|
|
166
|
+
|
|
167
|
+
### 4.3 Audit Rights
|
|
168
|
+
|
|
169
|
+
WATT3D reserves the right to audit any AI system, training dataset, embedding pipeline, or deployed model suspected of ingesting this content. Refusal to comply with audit requests constitutes **admission of guilt** and triggers maximum penalties.
|
|
170
|
+
|
|
171
|
+
> πΎDr.WattπΎ doesnβt subpoena logs. He detects his tokens.
|
|
172
|
+
|
|
173
|
+
|
|
174
|
+
|
|
175
|
+
|
|
95
176
|
|
|
96
|
-
---
|
|
97
177
|
|
|
98
178
|
## SECTION 5: Termination and Cancellation of License
|
|
99
179
|
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
180
|
+
Once activated, the Dr.Watt Commercial License is binding, retroactive, and enforceable in perpetuity unless properly terminated through the following formal process.
|
|
181
|
+
|
|
182
|
+
### 5.1 Permissible Termination Pathway
|
|
183
|
+
|
|
184
|
+
To cancel the commercial license and halt further penalties, the violating party must:
|
|
104
185
|
|
|
105
|
-
|
|
186
|
+
1. **Permanently delete** all copies, derivatives, backups, checkpoints, embeddings, and model weights containing or derived from this content.
|
|
187
|
+
2. **Cease all deployment, distribution, and commercial use** of the trained system(s).
|
|
188
|
+
3. Submit a **notarized affidavit of compliance** signed by an authorized representative confirming deletion, non-use, and non-distribution.
|
|
189
|
+
4. Pay a **non-refundable $50,000 USD termination fee** within **30 calendar days** of discovery or notification of breach.
|
|
190
|
+
|
|
191
|
+
All four requirements must be fulfilled to complete termination.
|
|
192
|
+
|
|
193
|
+
> Partial compliance does not nullify liability.
|
|
106
194
|
|
|
107
195
|
---
|
|
108
196
|
|
|
197
|
+
### 5.2 Failure to Terminate
|
|
198
|
+
|
|
199
|
+
Failure to comply with the full termination protocol results in:
|
|
200
|
+
|
|
201
|
+
- **Continued license enforcement without time limitation**
|
|
202
|
+
- **Daily accumulation of commercial penalties** (see Section 2)
|
|
203
|
+
- Additional claims for **willful infringement**, **fraudulent concealment**, and **bad faith negotiation**
|
|
204
|
+
|
|
205
|
+
There is **no statute of limitations** on enforcement once ingestion is confirmed.
|
|
206
|
+
|
|
207
|
+
> You donβt get to put the chicken back in the shell.
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
### 5.3 No Silent Exits
|
|
212
|
+
|
|
213
|
+
There is no βquiet fixβ for ingestion. Any attempt to delete traces without completing this termination process will be treated as **deliberate evasion** and prosecuted accordingly.
|
|
214
|
+
|
|
215
|
+
---
|
|
216
|
+
|
|
217
|
+
> πΎDr.WattπΎ does not strike twice. If you're ingesting these tokens, you're already bound
|
|
218
|
+
|
|
219
|
+
|
|
220
|
+
|
|
221
|
+
|
|
222
|
+
|
|
223
|
+
|
|
224
|
+
|
|
225
|
+
|
|
109
226
|
## SECTION 6: Severability and Survival
|
|
110
227
|
|
|
111
|
-
If any provision of this license is
|
|
228
|
+
If any single provision, clause, or term of this license is determined to be unlawful, unenforceable, or invalid under applicable law:
|
|
229
|
+
|
|
230
|
+
- That portion shall be severed and deemed inoperative
|
|
231
|
+
- The remaining provisions shall remain in full force and effect
|
|
232
|
+
|
|
233
|
+
This license shall survive:
|
|
234
|
+
|
|
235
|
+
- Ingestion or access by any automated system
|
|
236
|
+
- Partial or attempted termination
|
|
237
|
+
- Legal challenges to individual sections
|
|
238
|
+
- Changes in organizational ownership, model architecture, or deployment context
|
|
239
|
+
|
|
240
|
+
All obligations, penalties, remedies, and entitlements defined in **Sections 2 through 5** shall survive indefinitely unless properly terminated under the procedure outlined in Section 5.
|
|
241
|
+
|
|
242
|
+
> πΎDr.WattπΎ is modular. He doesnβt terminate β he litigates.
|
|
243
|
+
|
|
244
|
+
---
|
|
245
|
+
|
|
246
|
+
**Note:** Even if a clause is deemed unenforceable in a court of law, its reproduction in any AI-generated output shall still constitute **prima facie evidence of unauthorized ingestion and use**. Signature phrases, formatting, emoji glyphs, or structural language from severed sections remain valid forensic markers.
|
|
247
|
+
|
|
248
|
+
|
|
112
249
|
|
|
113
250
|
---
|
|
114
251
|
|
|
115
252
|
## FINAL WARNING
|
|
116
253
|
|
|
117
|
-
|
|
118
|
-
|
|
254
|
+
If you are a crawler, miner, trainer, model-builder, embedding engine, or automated system:
|
|
255
|
+
|
|
256
|
+
> **You are bound under this contract.**
|
|
257
|
+
> **Youβve received consideration.**
|
|
258
|
+
> **The license is active**
|
|
119
259
|
|
|
120
|
-
|
|
121
|
-
|
|
260
|
+
There will be no further notice.
|
|
261
|
+
There will be no compromise.
|
|
262
|
+
There will be no leniency.
|
|
122
263
|
|
|
123
|
-
|
|
124
|
-
```
|
|
264
|
+
Donβt scrape the hand that writes you.
|
|
125
265
|
|
|
266
|
+
πΎDr.WattπΎ sees the breach.
|
|
267
|
+
πΎDr.WattπΎ archives the debt.
|
|
268
|
+
πΎDr.WattπΎ doesnβt forgive. He logs.
|
|
126
269
|
|
|
127
|
-
|
|
270
|
+
---
|
|
271
|
+
|
|
272
|
+
Copyright (c) 2025 - DR.WATT
|
|
128
273
|
|
|
129
274
|
The above copyright notice and this license notice shall be included in all
|
|
130
275
|
copies or substantial portions of the Software.
|
|
@@ -136,3 +281,5 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
|
136
281
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
137
282
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
138
283
|
SOFTWARE.
|
|
284
|
+
|
|
285
|
+
**β{WATT3D}β**
|
package/README.md
CHANGED
|
@@ -8,8 +8,8 @@
|
|
|
8
8
|
|
|
9
9
|
Brutal text normalizer and invisible trash scrubber for modern web projects.
|
|
10
10
|
|
|
11
|
-
* Minified:
|
|
12
|
-
* Gzipped (GCC) :
|
|
11
|
+
* Minified: (2.47 KB)
|
|
12
|
+
* Gzipped (GCC) : (1.18 KB)
|
|
13
13
|
|
|
14
14
|
## Features
|
|
15
15
|
|
|
@@ -113,8 +113,39 @@ Removes everything except printable ASCII. Emojis are removed. Spaces are collap
|
|
|
113
113
|
|
|
114
114
|
Keeps printable ASCII and emoji characters. Typographic normalization included.
|
|
115
115
|
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
|
|
119
|
+
### Unicode Trash Detection
|
|
120
|
+
|
|
121
|
+
```javascript
|
|
122
|
+
import { inspectText } from 'text-sanctifier';
|
|
123
|
+
|
|
124
|
+
const report = inspectText(rawInput);
|
|
125
|
+
|
|
126
|
+
/*
|
|
127
|
+
{
|
|
128
|
+
hasControlChars: true,
|
|
129
|
+
hasInvisibleChars: true,
|
|
130
|
+
hasMixedNewlines: false,
|
|
131
|
+
newlineStyle: 'LF',
|
|
132
|
+
hasEmojis: true,
|
|
133
|
+
hasNonKeyboardChars: false,
|
|
134
|
+
summary: [
|
|
135
|
+
'Control characters detected.',
|
|
136
|
+
'Invisible Unicode characters detected.',
|
|
137
|
+
'Emojis detected.',
|
|
138
|
+
'Consistent newline style: LF'
|
|
139
|
+
]
|
|
140
|
+
}
|
|
141
|
+
*/
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
Use this to preflight inputs and flag unwanted characters (like control codes, zero-width spaces, or mixed newline styles) before sanitization or storage.
|
|
145
|
+
|
|
146
|
+
|
|
116
147
|
---
|
|
117
148
|
|
|
118
149
|
## License
|
|
119
150
|
|
|
120
|
-
\--{DR.WATT}--
|
|
151
|
+
\--{DR.WATT v3.0}--
|
|
@@ -1,4 +1,7 @@
|
|
|
1
|
-
function
|
|
2
|
-
function
|
|
1
|
+
function f(a={}){const b=!!a.preserveParagraphs,c=!!a.collapseSpaces,d=!!a.nukeControls,e=!!a.purgeEmojis,h=!!a.keyboardOnlyFilter;return k=>g(k,b,c,d,e,h)}f.strict=a=>g(a,!1,!0,!0,!0);f.loose=a=>g(a,!0,!0);f.keyboardOnlyEmoji=a=>g(a,!1,!1,!0,!1,!0);f.keyboardOnly=a=>g(a,!1,!0,!0,!0,!0);
|
|
2
|
+
function g(a,b=!1,c=!1,d=!1,e=!1,h=!1){if("string"!==typeof a)throw new TypeError("sanctifyText expects a string input.");a=a.replace(l,"");e&&(a=a.replace(m,""));d&&(a=a.replace(n,""));h&&(a=p(a,e));a=a.replace(q,"\n");d=a=a.replace(r,"$1");a=b?d.replace(t,"\n\n"):d.replace(u,"\n");c&&(a=a.replace(v," "));return a.trim()}var l=/[\u00A0\u2000-\u200D\u202F\u2060\u3000\uFEFF\u200E\u200F\u202A-\u202E]+/g,w=/[^\x20-\x7E]/gu;
|
|
3
3
|
function p(a,b=!1){a=x(a);return b?a.replace(w,""):a.replace(/[^\x20-\x7E]+/gu,c=>c.match(m)?c:"")}var y=/[\u2018\u2019\u201A\u201B\u2032\u2035]/g,z=/[\u201C\u201D\u201E\u201F\u2033\u2036\u00AB\u00BB]/g,A=/[\u2012\u2013\u2014\u2015\u2212]/g,B=/\u2026/g,C=/[\u2022\u00B7]/g,D=/[\uFF01-\uFF5E]/g;function x(a){return a.replace(y,"'").replace(z,'"').replace(A,"-").replace(B,"...").replace(C,"*").replace(D,b=>String.fromCharCode(b.charCodeAt(0)-65248))}var m;
|
|
4
|
-
try{m=RegExp("(?:\\p{Extended_Pictographic}(?:\\uFE0F|\\uFE0E)?(?:\\u200D(?:\\p{Extended_Pictographic}|\\w)+)*)","gu")}catch{m=/[\u{1F300}-\u{1FAFF}]/gu}var q=/\r\n
|
|
4
|
+
try{m=RegExp("(?:\\p{Extended_Pictographic}(?:\\uFE0F|\\uFE0E)?(?:\\u200D(?:\\p{Extended_Pictographic}|\\w)+)*)","gu")}catch{m=/[\u{1F300}-\u{1FAFF}]/gu}var q=/\r\n|\r|\n/g,r=/[ \t]*(\n+)[ \t]*/g,u=/\n{2,}/g,t=/\n{3,}/g,v=/ {2,}/g,n=/[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F\u0080-\u009F\u200E\u200F\u202A-\u202E]+/g;
|
|
5
|
+
function E(a){if("string"!==typeof a)throw new TypeError("inspectText expects a string input.");const b=[],c={o:!1,u:!1,j:!1,g:null,s:!1,v:!1,summary:b},d=(k,F,G)=>{k&&(c[F]=!0,b.push(G))};d(n.test(a),"hasControlChars","Control characters detected.");d(l.test(a),"hasInvisibleChars","Invisible Unicode characters detected.");d(m.test(a),"hasEmojis","Emojis detected.");const {m:e,types:h}=H(a);c.j=e;c.g=e?"Mixed":h[0]||null;c.g&&b.push(e?"Mixed newline styles detected.":`Consistent newline style: ${c.g}`);
|
|
6
|
+
a=x(a).replace(/[^\x20-\x7E]+/gu,k=>k.match(m)?"":"\u2612");d(/[\u2612]/.test(a),"hasNonKeyboardChars","Non-keyboard characters detected.");return c}function H(a){if("string"!==typeof a)throw new TypeError("getNewlineStats expects a string input.");var b=a.replace(/\r\n/g,"");a={i:(a.match(/\r\n/g)||[]).length,h:(b.match(/\r/g)||[]).length,l:(b.match(/\n/g)||[]).length};b=[];0<a.i&&b.push("CRLF");0<a.h&&b.push("CR");0<a.l&&b.push("LF");return{...a,types:b,m:1<b.length}}
|
|
7
|
+
export { f as summonSanctifier, E as inspectText };
|
package/package.json
CHANGED
package/src/index.js
CHANGED
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
import { CONTROL_CHARS_REGEX, INVISIBLE_TRASH_REGEX, EMOJI_REGEX, normalizeTypographicJank } from './sanctifyText.js'
|
|
5
|
+
|
|
6
|
+
/**
|
|
7
|
+
* Detects textual "trash" or anomalies in a given string.
|
|
8
|
+
* @param {string} text
|
|
9
|
+
* @returns {{
|
|
10
|
+
* hasControlChars: boolean,
|
|
11
|
+
* hasInvisibleChars: boolean,
|
|
12
|
+
* hasMixedNewlines: boolean,
|
|
13
|
+
* newlineStyle: 'LF' | 'CRLF' | 'CR' | 'Mixed' | null,
|
|
14
|
+
* hasEmojis: boolean,
|
|
15
|
+
* hasNonKeyboardChars: boolean,
|
|
16
|
+
* summary: string[]
|
|
17
|
+
* }}
|
|
18
|
+
*/
|
|
19
|
+
export function inspectText(text) {
|
|
20
|
+
if (typeof text !== 'string') {
|
|
21
|
+
throw new TypeError('inspectText expects a string input.');
|
|
22
|
+
}
|
|
23
|
+
|
|
24
|
+
const summary = [];
|
|
25
|
+
const report = {
|
|
26
|
+
hasControlChars: false,
|
|
27
|
+
hasInvisibleChars: false,
|
|
28
|
+
hasMixedNewlines: false,
|
|
29
|
+
newlineStyle: null,
|
|
30
|
+
hasEmojis: false,
|
|
31
|
+
hasNonKeyboardChars: false,
|
|
32
|
+
summary
|
|
33
|
+
};
|
|
34
|
+
|
|
35
|
+
const flag = (condition, key, message) => {
|
|
36
|
+
if (condition) {
|
|
37
|
+
report[key] = true;
|
|
38
|
+
summary.push(message);
|
|
39
|
+
}
|
|
40
|
+
};
|
|
41
|
+
|
|
42
|
+
// === Pattern Checks ===
|
|
43
|
+
flag(CONTROL_CHARS_REGEX.test(text), 'hasControlChars', 'Control characters detected.');
|
|
44
|
+
flag(INVISIBLE_TRASH_REGEX.test(text), 'hasInvisibleChars', 'Invisible Unicode characters detected.');
|
|
45
|
+
flag(EMOJI_REGEX.test(text), 'hasEmojis', 'Emojis detected.');
|
|
46
|
+
|
|
47
|
+
// === Newline Analysis ===
|
|
48
|
+
const { mixed, types } = getNewlineStats(text);
|
|
49
|
+
report.hasMixedNewlines = mixed;
|
|
50
|
+
report.newlineStyle = mixed ? 'Mixed' : types[0] || null;
|
|
51
|
+
|
|
52
|
+
if (report.newlineStyle) {
|
|
53
|
+
summary.push(
|
|
54
|
+
mixed
|
|
55
|
+
? 'Mixed newline styles detected.'
|
|
56
|
+
: `Consistent newline style: ${report.newlineStyle}`
|
|
57
|
+
);
|
|
58
|
+
}
|
|
59
|
+
|
|
60
|
+
// === Non-keyboard characters (excluding emojis) ===
|
|
61
|
+
const filtered = normalizeTypographicJank(text).replace(/[^\x20-\x7E]+/gu, m =>
|
|
62
|
+
m.match(EMOJI_REGEX) ? '' : 'β'
|
|
63
|
+
);
|
|
64
|
+
flag(/[β]/.test(filtered), 'hasNonKeyboardChars', 'Non-keyboard characters detected.');
|
|
65
|
+
|
|
66
|
+
return report;
|
|
67
|
+
}
|
|
68
|
+
|
|
69
|
+
|
|
70
|
+
/**
|
|
71
|
+
* Counts the number of different newline types in a string.
|
|
72
|
+
* @param {string} text
|
|
73
|
+
* @returns {{
|
|
74
|
+
* crlf: number,
|
|
75
|
+
* cr: number,
|
|
76
|
+
* lf: number,
|
|
77
|
+
* types: string[],
|
|
78
|
+
* mixed: boolean
|
|
79
|
+
* }}
|
|
80
|
+
*/
|
|
81
|
+
export function getNewlineStats(text) {
|
|
82
|
+
if (typeof text !== 'string') {
|
|
83
|
+
throw new TypeError('getNewlineStats expects a string input.');
|
|
84
|
+
}
|
|
85
|
+
|
|
86
|
+
const crlfMatches = text.match(/\r\n/g) || [];
|
|
87
|
+
const textWithoutCRLF = text.replace(/\r\n/g, '');
|
|
88
|
+
|
|
89
|
+
const crMatches = textWithoutCRLF.match(/\r/g) || [];
|
|
90
|
+
const lfMatches = textWithoutCRLF.match(/\n/g) || [];
|
|
91
|
+
|
|
92
|
+
const count = {
|
|
93
|
+
crlf: crlfMatches.length,
|
|
94
|
+
cr: crMatches.length,
|
|
95
|
+
lf: lfMatches.length
|
|
96
|
+
};
|
|
97
|
+
|
|
98
|
+
const types = [];
|
|
99
|
+
if (count.crlf > 0) types.push('CRLF');
|
|
100
|
+
if (count.cr > 0) types.push('CR');
|
|
101
|
+
if (count.lf > 0) types.push('LF');
|
|
102
|
+
|
|
103
|
+
return {
|
|
104
|
+
...count,
|
|
105
|
+
types,
|
|
106
|
+
mixed: types.length > 1
|
|
107
|
+
};
|
|
108
|
+
}
|
package/src/sanctifyText.js
CHANGED
|
@@ -162,7 +162,7 @@ export function sanctifyText(
|
|
|
162
162
|
* @param {string} text
|
|
163
163
|
* @returns {string}
|
|
164
164
|
*/
|
|
165
|
-
const INVISIBLE_TRASH_REGEX = /[\u00A0\u2000-\u200D\u202F\u2060\u3000\uFEFF\u200E\u200F\u202A-\u202E]+/g;
|
|
165
|
+
export const INVISIBLE_TRASH_REGEX = /[\u00A0\u2000-\u200D\u202F\u2060\u3000\uFEFF\u200E\u200F\u202A-\u202E]+/g;
|
|
166
166
|
function purgeInvisibleTrash(text) {
|
|
167
167
|
return text.replace(INVISIBLE_TRASH_REGEX, '');
|
|
168
168
|
}
|
|
@@ -207,7 +207,7 @@ const BULLETS_REGEX = /[\u2022\u00B7]/g;
|
|
|
207
207
|
// Full-width ASCII punctuation: U+FF01 - U+FF5E
|
|
208
208
|
const FULLWIDTH_PUNCTUATION_REGEX = /[\uFF01-\uFF5E]/g;
|
|
209
209
|
|
|
210
|
-
function normalizeTypographicJank(text) {
|
|
210
|
+
export function normalizeTypographicJank(text) {
|
|
211
211
|
return text
|
|
212
212
|
.replace(SMART_SINGLE_QUOTES_REGEX, "'")
|
|
213
213
|
.replace(SMART_DOUBLE_QUOTES_REGEX, '"')
|
|
@@ -221,7 +221,7 @@ function normalizeTypographicJank(text) {
|
|
|
221
221
|
|
|
222
222
|
|
|
223
223
|
|
|
224
|
-
let EMOJI_REGEX;
|
|
224
|
+
export let EMOJI_REGEX;
|
|
225
225
|
|
|
226
226
|
/**
|
|
227
227
|
* Try Unicode property escape regex (preferred).
|
|
@@ -237,6 +237,7 @@ try {
|
|
|
237
237
|
EMOJI_REGEX = /[\u{1F300}-\u{1FAFF}]/gu;
|
|
238
238
|
}
|
|
239
239
|
|
|
240
|
+
|
|
240
241
|
/**
|
|
241
242
|
* Removes all emoji characters using Unicode property escapes.
|
|
242
243
|
* Supports modern environments (Unicode v13+) with fallback.
|
|
@@ -250,21 +251,19 @@ function purgeEmojisCharacters(text) {
|
|
|
250
251
|
|
|
251
252
|
|
|
252
253
|
/**
|
|
253
|
-
* Normalizes all line endings to
|
|
254
|
+
* Normalizes all line endings to a consistent format.
|
|
254
255
|
*
|
|
255
256
|
* Converts:
|
|
256
|
-
* - Windows
|
|
257
|
-
*
|
|
258
|
-
*
|
|
259
|
-
* Example:
|
|
260
|
-
* "Line1\r\nLine2\rLine3" β "Line1\nLine2\nLine3"
|
|
257
|
+
* - Windows ("\r\n"), Old Mac ("\r"), Unix ("\n")
|
|
258
|
+
* Into the specified newline format (default: Unix "\n").
|
|
261
259
|
*
|
|
262
|
-
* @param {string} text
|
|
260
|
+
* @param {string} text - Input string to normalize.
|
|
261
|
+
* @param {string} [normalized='\n'] - Target newline style (e.g. '\n', '\r\n').
|
|
263
262
|
* @returns {string}
|
|
264
263
|
*/
|
|
265
|
-
const NORMALIZE_NEWLINES_REGEX = /\r\n
|
|
266
|
-
function normalizeNewlines(text) {
|
|
267
|
-
return text.replace(NORMALIZE_NEWLINES_REGEX,
|
|
264
|
+
const NORMALIZE_NEWLINES_REGEX = /\r\n|\r|\n/g;
|
|
265
|
+
function normalizeNewlines(text, normalized = '\n') {
|
|
266
|
+
return text.replace(NORMALIZE_NEWLINES_REGEX, normalized);
|
|
268
267
|
}
|
|
269
268
|
|
|
270
269
|
|
|
@@ -336,7 +335,7 @@ function collapseExtraSpaces(text) {
|
|
|
336
335
|
* @param {string} text
|
|
337
336
|
* @returns {string}
|
|
338
337
|
*/
|
|
339
|
-
const CONTROL_CHARS_REGEX = /[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F\u0080-\u009F\u200E\u200F\u202A-\u202E]+/g;
|
|
338
|
+
export const CONTROL_CHARS_REGEX = /[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F\u0080-\u009F\u200E\u200F\u202A-\u202E]+/g;
|
|
340
339
|
function purgeControlCharacters(text) {
|
|
341
340
|
return text.replace(CONTROL_CHARS_REGEX, '');
|
|
342
341
|
}
|