check-unicode 0.4.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- check_unicode-0.4.0/.gitignore +12 -0
- check_unicode-0.4.0/LICENSE +21 -0
- check_unicode-0.4.0/PKG-INFO +149 -0
- check_unicode-0.4.0/README.md +119 -0
- check_unicode-0.4.0/docs/check-unicode.1 +357 -0
- check_unicode-0.4.0/pyproject.toml +77 -0
- check_unicode-0.4.0/src/check_unicode/__init__.py +3 -0
- check_unicode-0.4.0/src/check_unicode/categories.py +62 -0
- check_unicode-0.4.0/src/check_unicode/checker.py +213 -0
- check_unicode-0.4.0/src/check_unicode/confusables.py +59 -0
- check_unicode-0.4.0/src/check_unicode/fixer.py +67 -0
- check_unicode-0.4.0/src/check_unicode/main.py +622 -0
- check_unicode-0.4.0/src/check_unicode/output.py +141 -0
- check_unicode-0.4.0/src/check_unicode/scripts.py +147 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 mit-d
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,149 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: check-unicode
|
|
3
|
+
Version: 0.4.0
|
|
4
|
+
Summary: Pre-commit hook to detect and fix non-ASCII Unicode characters
|
|
5
|
+
Project-URL: Changelog, https://github.com/mit-d/check-unicode/blob/main/CHANGELOG.md
|
|
6
|
+
Project-URL: Issues, https://github.com/mit-d/check-unicode/issues
|
|
7
|
+
Project-URL: Repository, https://github.com/mit-d/check-unicode
|
|
8
|
+
Author-email: mit-d <derekmttn@gmail.com>
|
|
9
|
+
License-Expression: MIT
|
|
10
|
+
License-File: LICENSE
|
|
11
|
+
Keywords: linting,pre-commit,security,trojan-source,unicode
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
|
+
Classifier: Environment :: Console
|
|
14
|
+
Classifier: Intended Audience :: Developers
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
21
|
+
Classifier: Topic :: Software Development :: Quality Assurance
|
|
22
|
+
Requires-Python: >=3.11
|
|
23
|
+
Provides-Extra: dev
|
|
24
|
+
Requires-Dist: bump-my-version; extra == 'dev'
|
|
25
|
+
Requires-Dist: mypy; extra == 'dev'
|
|
26
|
+
Requires-Dist: pytest; extra == 'dev'
|
|
27
|
+
Requires-Dist: pytest-cov; extra == 'dev'
|
|
28
|
+
Requires-Dist: ruff; extra == 'dev'
|
|
29
|
+
Description-Content-Type: text/markdown
|
|
30
|
+
|
|
31
|
+
# check-unicode
|
|
32
|
+
|
|
33
|
+
[](https://github.com/mit-d/check-unicode/actions/workflows/test.yml)
|
|
34
|
+
[](LICENSE)
|
|
35
|
+
[](https://www.python.org/)
|
|
36
|
+
|
|
37
|
+
A pre-commit hook to detect and fix non-ASCII Unicode characters in text files.
|
|
38
|
+
|
|
39
|
+
Catches smart quotes, em dashes, fancy spaces, dangerous invisible characters
|
|
40
|
+
(Trojan Source bidi attacks, zero-width chars), and other copy-paste artifacts.
|
|
41
|
+
|
|
42
|
+
## Installation
|
|
43
|
+
|
|
44
|
+
### As a pre-commit hook
|
|
45
|
+
|
|
46
|
+
```yaml
|
|
47
|
+
repos:
|
|
48
|
+
- repo: https://github.com/mit-d/check-unicode
|
|
49
|
+
rev: v0.4.0
|
|
50
|
+
hooks:
|
|
51
|
+
- id: check-unicode
|
|
52
|
+
# or for auto-fix:
|
|
53
|
+
- id: fix-unicode
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### Standalone
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
pip install check-unicode
|
|
60
|
+
check-unicode path/to/file.txt
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## Usage
|
|
64
|
+
|
|
65
|
+
```text
|
|
66
|
+
check-unicode [OPTIONS] [FILES...]
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
| Flag | Description | Default |
|
|
70
|
+
| ----------------------- | ----------------------------------------------------- | ------------- |
|
|
71
|
+
| `--fix` | Replace known offenders with ASCII, exit 1 if changed | off |
|
|
72
|
+
| `--allow-range RANGE` | Allow a Unicode range (e.g. `U+00A0-U+00FF`). Repeat. | none |
|
|
73
|
+
| `--allow-codepoint CP` | Allow codepoints (e.g. `U+00B0`). Repeat/comma-sep. | none |
|
|
74
|
+
| `--allow-category CAT` | Allow Unicode category (e.g. `Sc`). Repeatable. | none |
|
|
75
|
+
| `--allow-printable` | Allow all printable chars (only flag invisibles) | off |
|
|
76
|
+
| `--allow-script SCRIPT` | Allow Unicode script (e.g. `Latin`). Repeatable. | none |
|
|
77
|
+
| `--check-confusables` | Detect mixed-script homoglyph/confusable characters | off |
|
|
78
|
+
| `--severity LEVEL` | `error` (exit 1) or `warning` (print, exit 0) | `error` |
|
|
79
|
+
| `--no-color` | Disable ANSI color | auto-detect |
|
|
80
|
+
| `--config FILE` | Path to TOML config | auto-discover |
|
|
81
|
+
| `-q` / `--quiet` | Summary only | off |
|
|
82
|
+
| `-V` / `--version` | Print version | |
|
|
83
|
+
|
|
84
|
+
## What it catches
|
|
85
|
+
|
|
86
|
+
- **Smart quotes**: `\u201c` `\u201d` `\u2018` `\u2019` and variants
|
|
87
|
+
- **Dashes**: em dash `\u2014`, en dash `\u2013`, minus sign `\u2212`
|
|
88
|
+
- **Fancy spaces**: non-breaking space, em space, thin space, etc.
|
|
89
|
+
- **Ellipsis**: `\u2026`
|
|
90
|
+
- **Dangerous invisible characters** (always flagged):
|
|
91
|
+
- Bidi control (Trojan Source CVE-2021-42574): U+202A-202E, U+2066-2069
|
|
92
|
+
- Zero-width: U+200B-200F, U+FEFF (mid-file), U+2060-2064, U+180E
|
|
93
|
+
- Replacement character: U+FFFD
|
|
94
|
+
- **Confusable homoglyphs** (with `--check-confusables`):
|
|
95
|
+
- Mixed-script identifiers where minority-script chars look like Latin
|
|
96
|
+
- Cyrillic/Greek/Armenian letters that visually resemble Latin letters
|
|
97
|
+
- e.g. Cyrillic `a` (U+0430) mixed with Latin `ccess_level`
|
|
98
|
+
|
|
99
|
+
## Auto-fix
|
|
100
|
+
|
|
101
|
+
`--fix` replaces known offenders with ASCII equivalents:
|
|
102
|
+
|
|
103
|
+
| Unicode | Replacement |
|
|
104
|
+
| ------------ | ----------- |
|
|
105
|
+
| Smart quotes | `'` or `"` |
|
|
106
|
+
| En/em dashes | `--` |
|
|
107
|
+
| Minus sign | `-` |
|
|
108
|
+
| Fancy spaces | ` ` |
|
|
109
|
+
| Ellipsis | `...` |
|
|
110
|
+
|
|
111
|
+
Dangerous invisible characters are **never auto-fixed** -- they require manual
|
|
112
|
+
review.
|
|
113
|
+
|
|
114
|
+
## Configuration
|
|
115
|
+
|
|
116
|
+
Create `.check-unicode.toml` or add to `pyproject.toml`:
|
|
117
|
+
|
|
118
|
+
```toml
|
|
119
|
+
[tool.check-unicode]
|
|
120
|
+
allow-codepoints = ["U+00B0", "U+2192"]
|
|
121
|
+
allow-ranges = ["U+00A0-U+00FF"]
|
|
122
|
+
allow-categories = ["Sc"]
|
|
123
|
+
allow-printable = true
|
|
124
|
+
allow-scripts = ["Latin", "Cyrillic"]
|
|
125
|
+
check-confusables = true
|
|
126
|
+
severity = "error"
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
## Output
|
|
130
|
+
|
|
131
|
+
```text
|
|
132
|
+
path/to/file.txt:42:17: U+201C LEFT DOUBLE QUOTATION MARK [Ps]
|
|
133
|
+
He said \u201chello\u201d to the crowd
|
|
134
|
+
^
|
|
135
|
+
Found 5 non-ASCII characters in 2 files (3 fixable, 1 dangerous)
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
## Development
|
|
139
|
+
|
|
140
|
+
```bash
|
|
141
|
+
uv venv && uv pip install -e ".[dev]"
|
|
142
|
+
.venv/bin/pytest -v --cov
|
|
143
|
+
.venv/bin/ruff check src/ tests/
|
|
144
|
+
.venv/bin/mypy src/
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
## License
|
|
148
|
+
|
|
149
|
+
MIT
|
|
@@ -0,0 +1,119 @@
|
|
|
1
|
+
# check-unicode
|
|
2
|
+
|
|
3
|
+
[](https://github.com/mit-d/check-unicode/actions/workflows/test.yml)
|
|
4
|
+
[](LICENSE)
|
|
5
|
+
[](https://www.python.org/)
|
|
6
|
+
|
|
7
|
+
A pre-commit hook to detect and fix non-ASCII Unicode characters in text files.
|
|
8
|
+
|
|
9
|
+
Catches smart quotes, em dashes, fancy spaces, dangerous invisible characters
|
|
10
|
+
(Trojan Source bidi attacks, zero-width chars), and other copy-paste artifacts.
|
|
11
|
+
|
|
12
|
+
## Installation
|
|
13
|
+
|
|
14
|
+
### As a pre-commit hook
|
|
15
|
+
|
|
16
|
+
```yaml
|
|
17
|
+
repos:
|
|
18
|
+
- repo: https://github.com/mit-d/check-unicode
|
|
19
|
+
rev: v0.4.0
|
|
20
|
+
hooks:
|
|
21
|
+
- id: check-unicode
|
|
22
|
+
# or for auto-fix:
|
|
23
|
+
- id: fix-unicode
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
### Standalone
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
pip install check-unicode
|
|
30
|
+
check-unicode path/to/file.txt
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
## Usage
|
|
34
|
+
|
|
35
|
+
```text
|
|
36
|
+
check-unicode [OPTIONS] [FILES...]
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
| Flag | Description | Default |
|
|
40
|
+
| ----------------------- | ----------------------------------------------------- | ------------- |
|
|
41
|
+
| `--fix` | Replace known offenders with ASCII, exit 1 if changed | off |
|
|
42
|
+
| `--allow-range RANGE` | Allow a Unicode range (e.g. `U+00A0-U+00FF`). Repeat. | none |
|
|
43
|
+
| `--allow-codepoint CP` | Allow codepoints (e.g. `U+00B0`). Repeat/comma-sep. | none |
|
|
44
|
+
| `--allow-category CAT` | Allow Unicode category (e.g. `Sc`). Repeatable. | none |
|
|
45
|
+
| `--allow-printable` | Allow all printable chars (only flag invisibles) | off |
|
|
46
|
+
| `--allow-script SCRIPT` | Allow Unicode script (e.g. `Latin`). Repeatable. | none |
|
|
47
|
+
| `--check-confusables` | Detect mixed-script homoglyph/confusable characters | off |
|
|
48
|
+
| `--severity LEVEL` | `error` (exit 1) or `warning` (print, exit 0) | `error` |
|
|
49
|
+
| `--no-color` | Disable ANSI color | auto-detect |
|
|
50
|
+
| `--config FILE` | Path to TOML config | auto-discover |
|
|
51
|
+
| `-q` / `--quiet` | Summary only | off |
|
|
52
|
+
| `-V` / `--version` | Print version | |
|
|
53
|
+
|
|
54
|
+
## What it catches
|
|
55
|
+
|
|
56
|
+
- **Smart quotes**: `\u201c` `\u201d` `\u2018` `\u2019` and variants
|
|
57
|
+
- **Dashes**: em dash `\u2014`, en dash `\u2013`, minus sign `\u2212`
|
|
58
|
+
- **Fancy spaces**: non-breaking space, em space, thin space, etc.
|
|
59
|
+
- **Ellipsis**: `\u2026`
|
|
60
|
+
- **Dangerous invisible characters** (always flagged):
|
|
61
|
+
- Bidi control (Trojan Source CVE-2021-42574): U+202A-202E, U+2066-2069
|
|
62
|
+
- Zero-width: U+200B-200F, U+FEFF (mid-file), U+2060-2064, U+180E
|
|
63
|
+
- Replacement character: U+FFFD
|
|
64
|
+
- **Confusable homoglyphs** (with `--check-confusables`):
|
|
65
|
+
- Mixed-script identifiers where minority-script chars look like Latin
|
|
66
|
+
- Cyrillic/Greek/Armenian letters that visually resemble Latin letters
|
|
67
|
+
- e.g. Cyrillic `a` (U+0430) mixed with Latin `ccess_level`
|
|
68
|
+
|
|
69
|
+
## Auto-fix
|
|
70
|
+
|
|
71
|
+
`--fix` replaces known offenders with ASCII equivalents:
|
|
72
|
+
|
|
73
|
+
| Unicode | Replacement |
|
|
74
|
+
| ------------ | ----------- |
|
|
75
|
+
| Smart quotes | `'` or `"` |
|
|
76
|
+
| En/em dashes | `--` |
|
|
77
|
+
| Minus sign | `-` |
|
|
78
|
+
| Fancy spaces | ` ` |
|
|
79
|
+
| Ellipsis | `...` |
|
|
80
|
+
|
|
81
|
+
Dangerous invisible characters are **never auto-fixed** -- they require manual
|
|
82
|
+
review.
|
|
83
|
+
|
|
84
|
+
## Configuration
|
|
85
|
+
|
|
86
|
+
Create `.check-unicode.toml` or add to `pyproject.toml`:
|
|
87
|
+
|
|
88
|
+
```toml
|
|
89
|
+
[tool.check-unicode]
|
|
90
|
+
allow-codepoints = ["U+00B0", "U+2192"]
|
|
91
|
+
allow-ranges = ["U+00A0-U+00FF"]
|
|
92
|
+
allow-categories = ["Sc"]
|
|
93
|
+
allow-printable = true
|
|
94
|
+
allow-scripts = ["Latin", "Cyrillic"]
|
|
95
|
+
check-confusables = true
|
|
96
|
+
severity = "error"
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
## Output
|
|
100
|
+
|
|
101
|
+
```text
|
|
102
|
+
path/to/file.txt:42:17: U+201C LEFT DOUBLE QUOTATION MARK [Ps]
|
|
103
|
+
He said \u201chello\u201d to the crowd
|
|
104
|
+
^
|
|
105
|
+
Found 5 non-ASCII characters in 2 files (3 fixable, 1 dangerous)
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## Development
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
uv venv && uv pip install -e ".[dev]"
|
|
112
|
+
.venv/bin/pytest -v --cov
|
|
113
|
+
.venv/bin/ruff check src/ tests/
|
|
114
|
+
.venv/bin/mypy src/
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
## License
|
|
118
|
+
|
|
119
|
+
MIT
|
|
@@ -0,0 +1,357 @@
|
|
|
1
|
+
.\" Man page for check-unicode
|
|
2
|
+
.\" Generate with: man ./docs/check-unicode.1
|
|
3
|
+
.TH CHECK\-UNICODE 1 "2026-02-20" "check-unicode 0.2.0" "User Commands"
|
|
4
|
+
.
|
|
5
|
+
.SH NAME
|
|
6
|
+
check\-unicode \- detect and fix non\-ASCII Unicode characters in text files
|
|
7
|
+
.
|
|
8
|
+
.SH SYNOPSIS
|
|
9
|
+
.B check\-unicode
|
|
10
|
+
.RI [ OPTIONS ]
|
|
11
|
+
.IR FILE ...
|
|
12
|
+
.
|
|
13
|
+
.SH DESCRIPTION
|
|
14
|
+
.B check\-unicode
|
|
15
|
+
scans text files for non\-ASCII Unicode characters and reports their location,
|
|
16
|
+
codepoint, name, and category.
|
|
17
|
+
It is designed to catch copy\-paste artifacts such as smart quotes, em dashes,
|
|
18
|
+
fancy spaces, and dangerous invisible characters
|
|
19
|
+
(Trojan Source bidi attacks, zero\-width chars).
|
|
20
|
+
.PP
|
|
21
|
+
In
|
|
22
|
+
.B \-\-fix
|
|
23
|
+
mode it replaces known offenders with ASCII equivalents.
|
|
24
|
+
Dangerous invisible characters are never auto\-fixed and always require manual
|
|
25
|
+
review.
|
|
26
|
+
.PP
|
|
27
|
+
.B check\-unicode
|
|
28
|
+
is commonly used as a
|
|
29
|
+
.BR pre\-commit (1)
|
|
30
|
+
hook but also works as a standalone CLI tool.
|
|
31
|
+
.
|
|
32
|
+
.SH POSITIONAL ARGUMENTS
|
|
33
|
+
.TP
|
|
34
|
+
.I FILE ...
|
|
35
|
+
One or more files to check.
|
|
36
|
+
At least one file is required; the program exits with code\ 2 if none are
|
|
37
|
+
provided.
|
|
38
|
+
.
|
|
39
|
+
.SH OPTIONS
|
|
40
|
+
.SS Mode
|
|
41
|
+
.TP
|
|
42
|
+
.B \-\-fix
|
|
43
|
+
Replace known offenders (smart quotes, en/em dashes, fancy spaces, ellipsis)
|
|
44
|
+
with their ASCII equivalents using an atomic write (temp file + rename).
|
|
45
|
+
Exits\ 1 if any file was modified.
|
|
46
|
+
Dangerous invisible characters are never auto\-fixed.
|
|
47
|
+
.TP
|
|
48
|
+
.BR \-V ", " \-\-version
|
|
49
|
+
Print the program version and exit.
|
|
50
|
+
.
|
|
51
|
+
.SS Allow\-list options
|
|
52
|
+
These flags suppress findings for specific characters.
|
|
53
|
+
They extend (never replace) any values set in the config file.
|
|
54
|
+
Dangerous invisible characters are always flagged unless explicitly allowed by
|
|
55
|
+
.BR \-\-allow\-codepoint .
|
|
56
|
+
.TP
|
|
57
|
+
.BI \-\-allow\-range " RANGE"
|
|
58
|
+
Allow a Unicode range.
|
|
59
|
+
The format is
|
|
60
|
+
.IR U+XXXX\-U+YYYY .
|
|
61
|
+
May be repeated for multiple ranges.
|
|
62
|
+
.RS
|
|
63
|
+
.PP
|
|
64
|
+
Example:
|
|
65
|
+
.B \-\-allow\-range U+00A0\-U+00FF
|
|
66
|
+
.RE
|
|
67
|
+
.TP
|
|
68
|
+
.BI \-\-allow\-codepoint " CP"
|
|
69
|
+
Allow specific Unicode codepoints.
|
|
70
|
+
Accepts
|
|
71
|
+
.I U+XXXX
|
|
72
|
+
notation, comma\-separated and/or repeated.
|
|
73
|
+
This is the
|
|
74
|
+
.B only
|
|
75
|
+
flag that can suppress dangerous invisible characters.
|
|
76
|
+
.RS
|
|
77
|
+
.PP
|
|
78
|
+
Example:
|
|
79
|
+
.B \-\-allow\-codepoint U+00B0,U+00A9
|
|
80
|
+
.RE
|
|
81
|
+
.TP
|
|
82
|
+
.BI \-\-allow\-category " CAT"
|
|
83
|
+
Allow a Unicode general category.
|
|
84
|
+
May be repeated for multiple categories.
|
|
85
|
+
Use
|
|
86
|
+
.B \-\-list\-categories
|
|
87
|
+
to see all valid values.
|
|
88
|
+
.RS
|
|
89
|
+
.PP
|
|
90
|
+
Example:
|
|
91
|
+
.B \-\-allow\-category Sc
|
|
92
|
+
(Symbol, currency)
|
|
93
|
+
.RE
|
|
94
|
+
.TP
|
|
95
|
+
.B \-\-allow\-printable
|
|
96
|
+
Allow all printable non\-ASCII characters.
|
|
97
|
+
Only invisible and control characters will be flagged.
|
|
98
|
+
.TP
|
|
99
|
+
.BI \-\-allow\-script " SCRIPT"
|
|
100
|
+
Allow all characters from a Unicode script.
|
|
101
|
+
May be repeated.
|
|
102
|
+
Script names are case\-insensitive and normalized to title case.
|
|
103
|
+
Use
|
|
104
|
+
.B \-\-list\-scripts
|
|
105
|
+
to see all valid names.
|
|
106
|
+
.RS
|
|
107
|
+
.PP
|
|
108
|
+
Example:
|
|
109
|
+
.B \-\-allow\-script Cyrillic \-\-allow\-script Greek
|
|
110
|
+
.RE
|
|
111
|
+
.TP
|
|
112
|
+
.B \-\-list\-categories
|
|
113
|
+
Print all 30 Unicode general categories with descriptions and examples,
|
|
114
|
+
then exit.
|
|
115
|
+
Useful for discovering valid values for
|
|
116
|
+
.BR \-\-allow\-category .
|
|
117
|
+
.TP
|
|
118
|
+
.B \-\-list\-scripts
|
|
119
|
+
Print all known Unicode script names, then exit.
|
|
120
|
+
Useful for discovering valid values for
|
|
121
|
+
.BR \-\-allow\-script .
|
|
122
|
+
.
|
|
123
|
+
.SS Detection options
|
|
124
|
+
.TP
|
|
125
|
+
.B \-\-check\-confusables
|
|
126
|
+
Detect mixed\-script homoglyph/confusable characters, such as a Cyrillic
|
|
127
|
+
.B a
|
|
128
|
+
(U+0430) mixed into a Latin identifier.
|
|
129
|
+
This check is
|
|
130
|
+
.B not
|
|
131
|
+
suppressed by
|
|
132
|
+
.BR \-\-allow\-script .
|
|
133
|
+
.
|
|
134
|
+
.SS Output options
|
|
135
|
+
.TP
|
|
136
|
+
.BI \-\-severity " LEVEL"
|
|
137
|
+
Set exit\-code behavior.
|
|
138
|
+
.I LEVEL
|
|
139
|
+
must be
|
|
140
|
+
.B error
|
|
141
|
+
(exit\ 1 on findings) or
|
|
142
|
+
.B warning
|
|
143
|
+
(print findings but exit\ 0).
|
|
144
|
+
Default:
|
|
145
|
+
.BR error .
|
|
146
|
+
.TP
|
|
147
|
+
.B \-\-no\-color
|
|
148
|
+
Disable ANSI color output.
|
|
149
|
+
Color is also disabled when the
|
|
150
|
+
.B NO_COLOR
|
|
151
|
+
environment variable is set or stdout is not a TTY.
|
|
152
|
+
.TP
|
|
153
|
+
.BR \-q ", " \-\-quiet
|
|
154
|
+
Print the summary line only; suppress per\-finding details.
|
|
155
|
+
.
|
|
156
|
+
.SS Configuration
|
|
157
|
+
.TP
|
|
158
|
+
.BI \-\-config " FILE"
|
|
159
|
+
Path to a TOML config file.
|
|
160
|
+
If omitted, the program auto\-discovers
|
|
161
|
+
.I .check\-unicode.toml
|
|
162
|
+
in the current directory, or
|
|
163
|
+
.I [tool.check\-unicode]
|
|
164
|
+
in
|
|
165
|
+
.IR pyproject.toml .
|
|
166
|
+
.TP
|
|
167
|
+
.BI \-\-exclude\-pattern " PATTERN"
|
|
168
|
+
Exclude files matching a glob pattern.
|
|
169
|
+
May be repeated.
|
|
170
|
+
Extends any
|
|
171
|
+
.B exclude\-patterns
|
|
172
|
+
set in the config file.
|
|
173
|
+
Patterns are matched against both the full path and the basename.
|
|
174
|
+
.RS
|
|
175
|
+
.PP
|
|
176
|
+
Example:
|
|
177
|
+
.B \-\-exclude\-pattern '*.min.js' \-\-exclude\-pattern 'vendor/*'
|
|
178
|
+
.RE
|
|
179
|
+
.
|
|
180
|
+
.SH CONFIGURATION FILE
|
|
181
|
+
Settings can be stored in
|
|
182
|
+
.I .check\-unicode.toml
|
|
183
|
+
(standalone) or under the
|
|
184
|
+
.B [tool.check\-unicode]
|
|
185
|
+
table in
|
|
186
|
+
.IR pyproject.toml .
|
|
187
|
+
CLI flags always extend config\-file values; they never replace them.
|
|
188
|
+
.PP
|
|
189
|
+
.nf
|
|
190
|
+
.RS
|
|
191
|
+
[tool.check\-unicode]
|
|
192
|
+
allow\-codepoints = ["U+00B0", "U+2192"]
|
|
193
|
+
allow\-ranges = ["U+00A0\-U+00FF"]
|
|
194
|
+
allow\-categories = ["Sc"]
|
|
195
|
+
allow\-printable = true
|
|
196
|
+
allow\-scripts = ["Latin", "Cyrillic"]
|
|
197
|
+
check\-confusables = true
|
|
198
|
+
severity = "error"
|
|
199
|
+
exclude\-patterns = ["*.min.js", "vendor/*"]
|
|
200
|
+
.RE
|
|
201
|
+
.fi
|
|
202
|
+
.
|
|
203
|
+
.SH EXIT CODES
|
|
204
|
+
.TP
|
|
205
|
+
.B 0
|
|
206
|
+
No findings were detected, or
|
|
207
|
+
.B \-\-severity=warning
|
|
208
|
+
was used.
|
|
209
|
+
.TP
|
|
210
|
+
.B 1
|
|
211
|
+
Non\-ASCII findings were detected, or files were modified in
|
|
212
|
+
.B \-\-fix
|
|
213
|
+
mode.
|
|
214
|
+
.TP
|
|
215
|
+
.B 2
|
|
216
|
+
Usage error (invalid arguments, no files specified, etc.).
|
|
217
|
+
.
|
|
218
|
+
.SH WHAT IT CATCHES
|
|
219
|
+
.SS Copy\-paste artifacts (fixable with \-\-fix)
|
|
220
|
+
.TP
|
|
221
|
+
.B Smart quotes
|
|
222
|
+
\(lq\(rq \(oq\(cq and variants \(-> replaced with ASCII quotes
|
|
223
|
+
.TP
|
|
224
|
+
.B Dashes
|
|
225
|
+
Em dash (U+2014), en dash (U+2013), minus sign (U+2212) \(-> replaced with
|
|
226
|
+
.B \-\-
|
|
227
|
+
or
|
|
228
|
+
.BR \- .
|
|
229
|
+
.TP
|
|
230
|
+
.B Fancy spaces
|
|
231
|
+
Non\-breaking space, em space, thin space, and 14 other Unicode space characters
|
|
232
|
+
\(-> replaced with a regular space.
|
|
233
|
+
.TP
|
|
234
|
+
.B Ellipsis
|
|
235
|
+
Horizontal ellipsis (U+2026) \(-> replaced with
|
|
236
|
+
.BR ... .
|
|
237
|
+
.
|
|
238
|
+
.SS Dangerous invisible characters (never auto\-fixed)
|
|
239
|
+
.TP
|
|
240
|
+
.B Bidi controls (Trojan Source CVE\-2021\-42574)
|
|
241
|
+
U+202A\-202E (embedding/override), U+2066\-2069 (isolate).
|
|
242
|
+
These can make source code appear to do something different from what it
|
|
243
|
+
actually does.
|
|
244
|
+
.TP
|
|
245
|
+
.B Zero\-width characters
|
|
246
|
+
U+200B\-200F, U+FEFF (mid\-file BOM), U+2060\-2064, U+180E.
|
|
247
|
+
Invisible characters that can break identifiers or hide malicious code.
|
|
248
|
+
.TP
|
|
249
|
+
.B Replacement character
|
|
250
|
+
U+FFFD, usually indicates an encoding error.
|
|
251
|
+
.
|
|
252
|
+
.SS Confusable homoglyphs (with \-\-check\-confusables)
|
|
253
|
+
Mixed\-script identifiers where minority\-script characters visually resemble
|
|
254
|
+
Latin letters (e.g.\& Cyrillic
|
|
255
|
+
.I a
|
|
256
|
+
U+0430 in a Latin word).
|
|
257
|
+
.
|
|
258
|
+
.SH OUTPUT FORMAT
|
|
259
|
+
For each finding, the program prints the file, line, column, codepoint,
|
|
260
|
+
Unicode name, and general category:
|
|
261
|
+
.PP
|
|
262
|
+
.nf
|
|
263
|
+
.RS
|
|
264
|
+
path/to/file.txt:42:17: U+201C LEFT DOUBLE QUOTATION MARK [Ps]
|
|
265
|
+
He said \(lqhello\(rq to the crowd
|
|
266
|
+
^
|
|
267
|
+
.RE
|
|
268
|
+
.fi
|
|
269
|
+
.PP
|
|
270
|
+
After all findings, a summary line is printed:
|
|
271
|
+
.PP
|
|
272
|
+
.nf
|
|
273
|
+
.RS
|
|
274
|
+
Found 5 non\-ASCII characters in 2 files (3 fixable, 1 dangerous)
|
|
275
|
+
.RE
|
|
276
|
+
.fi
|
|
277
|
+
.
|
|
278
|
+
.SH ENVIRONMENT
|
|
279
|
+
.TP
|
|
280
|
+
.B NO_COLOR
|
|
281
|
+
If set (to any value), ANSI color output is disabled.
|
|
282
|
+
See
|
|
283
|
+
.IR https://no\-color.org/ .
|
|
284
|
+
.
|
|
285
|
+
.SH EXAMPLES
|
|
286
|
+
Check all Python files in a project:
|
|
287
|
+
.PP
|
|
288
|
+
.RS
|
|
289
|
+
.B check\-unicode src/**/*.py
|
|
290
|
+
.RE
|
|
291
|
+
.PP
|
|
292
|
+
Auto\-fix smart quotes and dashes:
|
|
293
|
+
.PP
|
|
294
|
+
.RS
|
|
295
|
+
.B check\-unicode \-\-fix *.txt
|
|
296
|
+
.RE
|
|
297
|
+
.PP
|
|
298
|
+
Allow printable characters, flag only invisibles:
|
|
299
|
+
.PP
|
|
300
|
+
.RS
|
|
301
|
+
.B check\-unicode \-\-allow\-printable src/
|
|
302
|
+
.RE
|
|
303
|
+
.PP
|
|
304
|
+
Detect confusables while allowing Cyrillic script:
|
|
305
|
+
.PP
|
|
306
|
+
.RS
|
|
307
|
+
.B check\-unicode \-\-check\-confusables \-\-allow\-script Cyrillic src/
|
|
308
|
+
.RE
|
|
309
|
+
.PP
|
|
310
|
+
Warn without failing CI, disable color:
|
|
311
|
+
.PP
|
|
312
|
+
.RS
|
|
313
|
+
.B check\-unicode \-\-severity warning \-\-no\-color src/
|
|
314
|
+
.RE
|
|
315
|
+
.PP
|
|
316
|
+
List all valid Unicode script names:
|
|
317
|
+
.PP
|
|
318
|
+
.RS
|
|
319
|
+
.B check\-unicode \-\-list\-scripts
|
|
320
|
+
.RE
|
|
321
|
+
.PP
|
|
322
|
+
List all valid Unicode general categories:
|
|
323
|
+
.PP
|
|
324
|
+
.RS
|
|
325
|
+
.B check\-unicode \-\-list\-categories
|
|
326
|
+
.RE
|
|
327
|
+
.PP
|
|
328
|
+
Use with pre\-commit:
|
|
329
|
+
.PP
|
|
330
|
+
.nf
|
|
331
|
+
.RS
|
|
332
|
+
repos:
|
|
333
|
+
\- repo: https://github.com/mit\-d/check\-unicode
|
|
334
|
+
rev: v0.2.0
|
|
335
|
+
hooks:
|
|
336
|
+
\- id: check\-unicode
|
|
337
|
+
# or for auto\-fix:
|
|
338
|
+
\- id: fix\-unicode
|
|
339
|
+
.RE
|
|
340
|
+
.fi
|
|
341
|
+
.
|
|
342
|
+
.SH SEE ALSO
|
|
343
|
+
.BR pre\-commit (1),
|
|
344
|
+
.BR python3 (1),
|
|
345
|
+
.BR unicode (7)
|
|
346
|
+
.PP
|
|
347
|
+
Project repository:
|
|
348
|
+
.I https://github.com/mit\-d/check\-unicode
|
|
349
|
+
.
|
|
350
|
+
.SH AUTHORS
|
|
351
|
+
mit\-d <derekmttn@gmail.com>
|
|
352
|
+
.
|
|
353
|
+
.SH LICENSE
|
|
354
|
+
MIT License.
|
|
355
|
+
See the
|
|
356
|
+
.I LICENSE
|
|
357
|
+
file in the source distribution.
|