mojihen 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- mojihen-0.1.0/LICENSE +21 -0
- mojihen-0.1.0/PKG-INFO +355 -0
- mojihen-0.1.0/README.md +330 -0
- mojihen-0.1.0/pyproject.toml +48 -0
- mojihen-0.1.0/setup.cfg +4 -0
- mojihen-0.1.0/src/mojihen/__init__.py +7 -0
- mojihen-0.1.0/src/mojihen/cli.py +389 -0
- mojihen-0.1.0/src/mojihen/config.py +164 -0
- mojihen-0.1.0/src/mojihen/corpus.py +182 -0
- mojihen-0.1.0/src/mojihen/data/seed.json +113 -0
- mojihen-0.1.0/src/mojihen/decode.py +234 -0
- mojihen-0.1.0/src/mojihen/detect.py +318 -0
- mojihen-0.1.0/src/mojihen/extract.py +151 -0
- mojihen-0.1.0/src/mojihen/report.py +209 -0
- mojihen-0.1.0/src/mojihen.egg-info/PKG-INFO +355 -0
- mojihen-0.1.0/src/mojihen.egg-info/SOURCES.txt +24 -0
- mojihen-0.1.0/src/mojihen.egg-info/dependency_links.txt +1 -0
- mojihen-0.1.0/src/mojihen.egg-info/entry_points.txt +2 -0
- mojihen-0.1.0/src/mojihen.egg-info/requires.txt +6 -0
- mojihen-0.1.0/src/mojihen.egg-info/top_level.txt +1 -0
- mojihen-0.1.0/tests/test_cli_exit.py +159 -0
- mojihen-0.1.0/tests/test_corpus_regression.py +120 -0
- mojihen-0.1.0/tests/test_decode.py +136 -0
- mojihen-0.1.0/tests/test_extract.py +178 -0
- mojihen-0.1.0/tests/test_hook_stdin.py +152 -0
- mojihen-0.1.0/tests/test_precision.py +94 -0
mojihen-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 hryoma1217
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
mojihen-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,355 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: mojihen
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: LLM-generated CJK corruption linter — catches valid-but-wrong kanji/hanzi that grep and tests miss
|
|
5
|
+
License: MIT
|
|
6
|
+
Keywords: cjk,japanese,linter,llm,unicode,corruption,pre-commit,ci
|
|
7
|
+
Classifier: Development Status :: 3 - Alpha
|
|
8
|
+
Classifier: Environment :: Console
|
|
9
|
+
Classifier: Intended Audience :: Developers
|
|
10
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
16
|
+
Classifier: Topic :: Software Development :: Quality Assurance
|
|
17
|
+
Classifier: Topic :: Text Processing :: Linguistic
|
|
18
|
+
Requires-Python: >=3.9
|
|
19
|
+
Description-Content-Type: text/markdown
|
|
20
|
+
License-File: LICENSE
|
|
21
|
+
Requires-Dist: tomli>=1.1.0; python_version < "3.11"
|
|
22
|
+
Provides-Extra: dev
|
|
23
|
+
Requires-Dist: pytest; extra == "dev"
|
|
24
|
+
Dynamic: license-file
|
|
25
|
+
|
|
26
|
+
# mojihen
|
|
27
|
+
|
|
28
|
+
**LLM-generated CJK corruption linter.** Catches *valid-but-wrong* kanji, hanzi,
|
|
29
|
+
and hangul that language models emit silently — the class of bug that grep, unit
|
|
30
|
+
tests, and every existing Unicode safety tool passes as **false-green**.
|
|
31
|
+
|
|
32
|
+
```
|
|
33
|
+
demo/sample.py:20:1 MH001 HIGH '闾' -> likely: 閾
|
|
34
|
+
'闾' is a known LLM corruption (likely intended: 閾) [rare_drift]
|
|
35
|
+
demo/sample.py:23:1 MH001 HIGH '耒' -> likely: 耐
|
|
36
|
+
'耒' is a known LLM corruption (likely intended: 耐) [decomposition]
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## The problem
|
|
42
|
+
|
|
43
|
+
When an LLM writes Japanese, Chinese, or Korean copy, it does not corrupt bytes —
|
|
44
|
+
it substitutes a **real, valid character** that looks or sounds close to the
|
|
45
|
+
intended one. The wrong glyph is itself a legitimate Unicode codepoint.
|
|
46
|
+
|
|
47
|
+
### Six observed cases (LLM-generated Japanese)
|
|
48
|
+
|
|
49
|
+
| Intended | LLM emitted | Class | Why it hid |
|
|
50
|
+
|---|---|---|---|
|
|
51
|
+
| 閾 (threshold) | 闾 (village gate, U+95FE) | rare drift | 閾 is uncommon; LLM drifted to adjacent codepoint |
|
|
52
|
+
| 耐 (endure) | 耒 (plow radical, U+8012) | decomposition | 耐→耒耗 radical fragment; 耒 alone near-absent in modern JA |
|
|
53
|
+
| 滞 (stagnation) | 滹 (river name, U+6EF9) | radical | Radical visual confusion |
|
|
54
|
+
| 亊 (rare variant) | 事 (matter) | rare variant | U+4E8A vs U+4E8B, adjacent, visually identical |
|
|
55
|
+
| 愛 (love) | 感 (feeling) | visual/semantic | Both common; low-confidence in corpus (see below) |
|
|
56
|
+
| 敏 (nimble) | 敢 (bold) | shape | Stroke near-miss; low-confidence |
|
|
57
|
+
|
|
58
|
+
### Why existing tools miss it
|
|
59
|
+
|
|
60
|
+
- **grep / ripgrep**: searches for the *intended* string; the wrong glyph simply
|
|
61
|
+
does not match. Silent.
|
|
62
|
+
- **Unit tests**: assertions were written against the already-corrupted value.
|
|
63
|
+
They pass. This actually happened.
|
|
64
|
+
- **Unicode safety linters** (`bidichk`, `anti-trojan-source`, `unicode-safety-check`):
|
|
65
|
+
target *adversarial* unicode (invisible chars, bidi overrides, homoglyphs). These
|
|
66
|
+
substitutions are visible, in-script, non-adversarial. Out of scope for those tools.
|
|
67
|
+
- **Chinese Spell Check (CSC) research**: models that correct *human* typos;
|
|
68
|
+
not packaged as a dev linter / CI gate / agent hook.
|
|
69
|
+
|
|
70
|
+
mojihen is **first-in-category** for this failure mode.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Install
|
|
75
|
+
|
|
76
|
+
```bash
|
|
77
|
+
pip install mojihen
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Python 3.9+ required. Zero runtime dependencies beyond stdlib.
|
|
81
|
+
(`tomllib` is used on Python 3.11+; on older versions, config file parsing
|
|
82
|
+
gracefully degrades to defaults if `tomli` is not installed.)
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## CLI usage
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
# Scan a file or directory
|
|
90
|
+
mojihen src/
|
|
91
|
+
|
|
92
|
+
# Scan with explicit options
|
|
93
|
+
mojihen src/ --format tty --fail-on high
|
|
94
|
+
|
|
95
|
+
# Output machine-readable JSON
|
|
96
|
+
mojihen src/ --format json > findings.json
|
|
97
|
+
|
|
98
|
+
# Output SARIF (for GitHub code scanning)
|
|
99
|
+
mojihen src/ --format sarif > mojihen.sarif
|
|
100
|
+
|
|
101
|
+
# Scan all text (bypass type-aware extraction)
|
|
102
|
+
mojihen src/ --all-text
|
|
103
|
+
|
|
104
|
+
# Use a custom config
|
|
105
|
+
mojihen src/ --config path/to/mojihen.toml
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
### Exit codes
|
|
109
|
+
|
|
110
|
+
| Code | Meaning |
|
|
111
|
+
|------|---------|
|
|
112
|
+
| 0 | No findings at or above the fail threshold |
|
|
113
|
+
| 1 | One or more findings at or above the fail threshold |
|
|
114
|
+
| 2 | Usage error, or agent hook blocked a write |
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## pre-commit
|
|
119
|
+
|
|
120
|
+
Add to `.pre-commit-config.yaml`:
|
|
121
|
+
|
|
122
|
+
```yaml
|
|
123
|
+
repos:
|
|
124
|
+
- repo: https://github.com/hryoma1217/mojihen
|
|
125
|
+
rev: v0.1.0
|
|
126
|
+
hooks:
|
|
127
|
+
- id: mojihen
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
This uses the bundled `.pre-commit-hooks.yaml` which runs
|
|
131
|
+
`mojihen --fail-on high` on every staged file.
|
|
132
|
+
|
|
133
|
+
---
|
|
134
|
+
|
|
135
|
+
## GitHub Action
|
|
136
|
+
|
|
137
|
+
```yaml
|
|
138
|
+
# .github/workflows/mojihen.yml
|
|
139
|
+
name: CJK corruption check
|
|
140
|
+
on: [push, pull_request]
|
|
141
|
+
|
|
142
|
+
jobs:
|
|
143
|
+
mojihen:
|
|
144
|
+
runs-on: ubuntu-latest
|
|
145
|
+
steps:
|
|
146
|
+
- uses: actions/checkout@v4
|
|
147
|
+
- uses: hryoma1217/mojihen@v0.1.0
|
|
148
|
+
with:
|
|
149
|
+
paths: src/
|
|
150
|
+
fail-on: high
|
|
151
|
+
format: sarif
|
|
152
|
+
sarif-output: mojihen.sarif
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
Findings appear in the GitHub Security tab (code scanning).
|
|
156
|
+
|
|
157
|
+
---
|
|
158
|
+
|
|
159
|
+
## Agent hook (Claude Code / Codex)
|
|
160
|
+
|
|
161
|
+
The killer use-case: scan **just-written text** before it reaches the filesystem,
|
|
162
|
+
and bounce corrupt output back to the model immediately.
|
|
163
|
+
|
|
164
|
+
### Claude Code (PostToolUse)
|
|
165
|
+
|
|
166
|
+
In `.claude/settings.json`:
|
|
167
|
+
|
|
168
|
+
```json
|
|
169
|
+
{
|
|
170
|
+
"hooks": {
|
|
171
|
+
"PostToolUse": [
|
|
172
|
+
{
|
|
173
|
+
"matcher": "Write|Edit",
|
|
174
|
+
"hooks": [
|
|
175
|
+
{ "type": "command", "command": "mojihen hook --stdin" }
|
|
176
|
+
]
|
|
177
|
+
}
|
|
178
|
+
]
|
|
179
|
+
}
|
|
180
|
+
}
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
### Codex
|
|
184
|
+
|
|
185
|
+
In `.codex/config.toml`:
|
|
186
|
+
|
|
187
|
+
```toml
|
|
188
|
+
[hooks]
|
|
189
|
+
post_write = "mojihen hook --stdin"
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
### What happens on corruption
|
|
193
|
+
|
|
194
|
+
```
|
|
195
|
+
mojihen: BLOCKED - LLM CJK corruption detected
|
|
196
|
+
|
|
197
|
+
src/strings.py:3:18 MH001 HIGH '闾' -> likely: 閾
|
|
198
|
+
src/strings.py:5:12 MH001 HIGH '耒' -> likely: 耐
|
|
199
|
+
|
|
200
|
+
Verify the intended CJK text and rewrite before proceeding.
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
The hook exits 2; the agent sees the block reason and retries with corrected text.
|
|
204
|
+
|
|
205
|
+
See `hooks/claude-code.md` and `hooks/codex.md` for full setup instructions.
|
|
206
|
+
|
|
207
|
+
---
|
|
208
|
+
|
|
209
|
+
## Configuration
|
|
210
|
+
|
|
211
|
+
Create `mojihen.toml` in your project root (or `[tool.mojihen]` in `pyproject.toml`):
|
|
212
|
+
|
|
213
|
+
```toml
|
|
214
|
+
# mojihen.toml
|
|
215
|
+
fail_on = "high" # "high" | "medium"
|
|
216
|
+
langs = ["ja", "zh", "ko"]
|
|
217
|
+
extract = "auto" # "auto" (type-aware) | "all-text"
|
|
218
|
+
allow = [] # literal strings/chars to never flag
|
|
219
|
+
corpus = [] # extra corpus JSON paths
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### Inline suppression
|
|
223
|
+
|
|
224
|
+
Suppress findings on a specific line:
|
|
225
|
+
|
|
226
|
+
```python
|
|
227
|
+
# Intentional use of the archaic character (corpus fixture)
|
|
228
|
+
FIXTURE = "闾" # mojihen: ignore
|
|
229
|
+
|
|
230
|
+
# Suppress only a specific rule
|
|
231
|
+
FIXTURE = "闾" # mojihen: ignore[MH001]
|
|
232
|
+
```
|
|
233
|
+
|
|
234
|
+
---
|
|
235
|
+
|
|
236
|
+
## How the corpus works
|
|
237
|
+
|
|
238
|
+
`src/mojihen/data/seed.json` is a versioned, schema-validated list of known-wrong chars:
|
|
239
|
+
|
|
240
|
+
```json
|
|
241
|
+
{
|
|
242
|
+
"version": 1,
|
|
243
|
+
"entries": [
|
|
244
|
+
{
|
|
245
|
+
"wrong": "闾",
|
|
246
|
+
"intended": ["閾"],
|
|
247
|
+
"lang": "ja",
|
|
248
|
+
"class": "rare_drift",
|
|
249
|
+
"evidence": "observed in LLM Japanese output",
|
|
250
|
+
"confidence": "high"
|
|
251
|
+
}
|
|
252
|
+
]
|
|
253
|
+
}
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
### Confidence tiers
|
|
257
|
+
|
|
258
|
+
| Tier | Meaning | CLI behaviour |
|
|
259
|
+
|------|---------|---------------|
|
|
260
|
+
| `high` | Rare char; near-zero false positives | Fails CI by default |
|
|
261
|
+
| `medium` | Somewhat common; context-dependent | Warns; optionally fails |
|
|
262
|
+
| `low` | Common char; production evidence but ambiguous | Info only |
|
|
263
|
+
|
|
264
|
+
High-confidence entries are chars like `闾` (U+95FE) that are essentially absent
|
|
265
|
+
from modern Japanese/Chinese text and almost certainly signal LLM drift.
|
|
266
|
+
Common kanji like `感` are kept at `low` to avoid flooding legitimate text with
|
|
267
|
+
false positives.
|
|
268
|
+
|
|
269
|
+
### Contributing a new entry
|
|
270
|
+
|
|
271
|
+
1. Confirm the wrong char is a known-bad substitution with evidence (build log,
|
|
272
|
+
diff, screenshot).
|
|
273
|
+
2. Confirm `wrong != intended` and both contain valid CJK.
|
|
274
|
+
3. Choose `"confidence": "high"` only if the wrong char is rare in normal text.
|
|
275
|
+
4. Add to `src/mojihen/data/seed.json` and run: `python -m unittest discover -s tests`
|
|
276
|
+
5. The precision gate (`test_precision.py`) must still pass with zero MH001 high
|
|
277
|
+
findings on the clean fixture sentences.
|
|
278
|
+
|
|
279
|
+
---
|
|
280
|
+
|
|
281
|
+
## Detectors
|
|
282
|
+
|
|
283
|
+
| ID | Name | Confidence |
|
|
284
|
+
|----|------|------------|
|
|
285
|
+
| MH001 | Corpus hit | high/medium/low (per entry) |
|
|
286
|
+
| MH002 | Mixed-script token (Han + Latin/Cyrillic in one identifier) | medium |
|
|
287
|
+
| MH003 | Isolated CJK in ASCII identifier / key / URL | medium |
|
|
288
|
+
| MH004 | Rare/archaic codepoint (needs Unihan freq table) | deferred |
|
|
289
|
+
| MH005 | Decomposition garble (needs radical table) | deferred |
|
|
290
|
+
|
|
291
|
+
MH004 and MH005 are deferred in v1 — the known MH005 cases (耒耗, etc.) are
|
|
292
|
+
already covered by individual MH001 corpus entries.
|
|
293
|
+
|
|
294
|
+
---
|
|
295
|
+
|
|
296
|
+
## Escape decoding
|
|
297
|
+
|
|
298
|
+
mojihen decodes all escape forms **before** inspecting text, because LLMs
|
|
299
|
+
frequently emit corrupted characters as `\uXXXX` escapes:
|
|
300
|
+
|
|
301
|
+
| Form | Example | Decoded |
|
|
302
|
+
|------|---------|---------|
|
|
303
|
+
| `\uXXXX` | `\u95FE` | 闾 |
|
|
304
|
+
| `\u{XXXXXX}` | `\u{95FE}` | 闾 |
|
|
305
|
+
| Surrogate pair | `\uD83D\uDE00` | 😀 |
|
|
306
|
+
| `\xXX` | `\x41` | A |
|
|
307
|
+
| HTML decimal | `闾` | 闾 |
|
|
308
|
+
| HTML hex | `闾` | 闾 |
|
|
309
|
+
| Named entity | `&` | & |
|
|
310
|
+
|
|
311
|
+
---
|
|
312
|
+
|
|
313
|
+
## Limitations and false-positive controls
|
|
314
|
+
|
|
315
|
+
- **Common kanji**: Characters like `感` (feeling), `末` (end), `士` (person)
|
|
316
|
+
appear in thousands of legitimate Japanese words. They are only added to the
|
|
317
|
+
corpus at `low` confidence. Use `--fail-on high` (the default) to avoid noise.
|
|
318
|
+
- **Context-free**: mojihen does not understand grammar or intent — it pattern-
|
|
319
|
+
matches against a corpus. False positives in unusual text can be suppressed
|
|
320
|
+
with `allow = [...]` in config or inline `# mojihen: ignore`.
|
|
321
|
+
- **MH002/MH003** are medium-confidence and require `--fail-on medium` to fail CI.
|
|
322
|
+
They are informational by default.
|
|
323
|
+
- The clean-corpus precision gate (`tests/test_precision.py`) must stay green;
|
|
324
|
+
this is the automated false-positive guard.
|
|
325
|
+
|
|
326
|
+
---
|
|
327
|
+
|
|
328
|
+
## 日本語について (Japanese section)
|
|
329
|
+
|
|
330
|
+
`mojihen`(文字変)は、LLMが生成した日本語・中国語・韓国語のテキストに含まれる
|
|
331
|
+
「正しいUnicodeコードポイントだが意図と異なる漢字」を検出するリンターです。
|
|
332
|
+
|
|
333
|
+
grepや単体テストではこの種の文字化けを検出できません。なぜなら間違った文字も
|
|
334
|
+
正規のUnicodeであり、テストはすでに化けた値に対して書かれているからです。
|
|
335
|
+
|
|
336
|
+
`mojihen`は既知の誤用パターンを収録したコーパス(`src/mojihen/data/seed.json`)と、
|
|
337
|
+
エスケープ形式(`\uXXXX`、`&#NNNN;`等)のデコードを組み合わせて、
|
|
338
|
+
CI・pre-commit・AIエージェントのフック(PostToolUse)として動作します。
|
|
339
|
+
|
|
340
|
+
---
|
|
341
|
+
|
|
342
|
+
## Development
|
|
343
|
+
|
|
344
|
+
```bash
|
|
345
|
+
git clone https://github.com/hryoma1217/mojihen
|
|
346
|
+
cd mojihen
|
|
347
|
+
pip install -e ".[dev]"
|
|
348
|
+
python -m unittest discover -s tests -v
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
---
|
|
352
|
+
|
|
353
|
+
## License
|
|
354
|
+
|
|
355
|
+
MIT. Copyright 2026 hryoma1217.
|
mojihen-0.1.0/README.md
ADDED
|
@@ -0,0 +1,330 @@
|
|
|
1
|
+
# mojihen
|
|
2
|
+
|
|
3
|
+
**LLM-generated CJK corruption linter.** Catches *valid-but-wrong* kanji, hanzi,
|
|
4
|
+
and hangul that language models emit silently — the class of bug that grep, unit
|
|
5
|
+
tests, and every existing Unicode safety tool passes as **false-green**.
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
demo/sample.py:20:1 MH001 HIGH '闾' -> likely: 閾
|
|
9
|
+
'闾' is a known LLM corruption (likely intended: 閾) [rare_drift]
|
|
10
|
+
demo/sample.py:23:1 MH001 HIGH '耒' -> likely: 耐
|
|
11
|
+
'耒' is a known LLM corruption (likely intended: 耐) [decomposition]
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## The problem
|
|
17
|
+
|
|
18
|
+
When an LLM writes Japanese, Chinese, or Korean copy, it does not corrupt bytes —
|
|
19
|
+
it substitutes a **real, valid character** that looks or sounds close to the
|
|
20
|
+
intended one. The wrong glyph is itself a legitimate Unicode codepoint.
|
|
21
|
+
|
|
22
|
+
### Six observed cases (LLM-generated Japanese)
|
|
23
|
+
|
|
24
|
+
| Intended | LLM emitted | Class | Why it hid |
|
|
25
|
+
|---|---|---|---|
|
|
26
|
+
| 閾 (threshold) | 闾 (village gate, U+95FE) | rare drift | 閾 is uncommon; LLM drifted to adjacent codepoint |
|
|
27
|
+
| 耐 (endure) | 耒 (plow radical, U+8012) | decomposition | 耐→耒耗 radical fragment; 耒 alone near-absent in modern JA |
|
|
28
|
+
| 滞 (stagnation) | 滹 (river name, U+6EF9) | radical | Radical visual confusion |
|
|
29
|
+
| 亊 (rare variant) | 事 (matter) | rare variant | U+4E8A vs U+4E8B, adjacent, visually identical |
|
|
30
|
+
| 愛 (love) | 感 (feeling) | visual/semantic | Both common; low-confidence in corpus (see below) |
|
|
31
|
+
| 敏 (nimble) | 敢 (bold) | shape | Stroke near-miss; low-confidence |
|
|
32
|
+
|
|
33
|
+
### Why existing tools miss it
|
|
34
|
+
|
|
35
|
+
- **grep / ripgrep**: searches for the *intended* string; the wrong glyph simply
|
|
36
|
+
does not match. Silent.
|
|
37
|
+
- **Unit tests**: assertions were written against the already-corrupted value.
|
|
38
|
+
They pass. This actually happened.
|
|
39
|
+
- **Unicode safety linters** (`bidichk`, `anti-trojan-source`, `unicode-safety-check`):
|
|
40
|
+
target *adversarial* unicode (invisible chars, bidi overrides, homoglyphs). These
|
|
41
|
+
substitutions are visible, in-script, non-adversarial. Out of scope for those tools.
|
|
42
|
+
- **Chinese Spell Check (CSC) research**: models that correct *human* typos;
|
|
43
|
+
not packaged as a dev linter / CI gate / agent hook.
|
|
44
|
+
|
|
45
|
+
mojihen is **first-in-category** for this failure mode.
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## Install
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
pip install mojihen
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
Python 3.9+ required. Zero runtime dependencies beyond stdlib.
|
|
56
|
+
(`tomllib` is used on Python 3.11+; on older versions, config file parsing
|
|
57
|
+
gracefully degrades to defaults if `tomli` is not installed.)
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## CLI usage
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
# Scan a file or directory
|
|
65
|
+
mojihen src/
|
|
66
|
+
|
|
67
|
+
# Scan with explicit options
|
|
68
|
+
mojihen src/ --format tty --fail-on high
|
|
69
|
+
|
|
70
|
+
# Output machine-readable JSON
|
|
71
|
+
mojihen src/ --format json > findings.json
|
|
72
|
+
|
|
73
|
+
# Output SARIF (for GitHub code scanning)
|
|
74
|
+
mojihen src/ --format sarif > mojihen.sarif
|
|
75
|
+
|
|
76
|
+
# Scan all text (bypass type-aware extraction)
|
|
77
|
+
mojihen src/ --all-text
|
|
78
|
+
|
|
79
|
+
# Use a custom config
|
|
80
|
+
mojihen src/ --config path/to/mojihen.toml
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Exit codes
|
|
84
|
+
|
|
85
|
+
| Code | Meaning |
|
|
86
|
+
|------|---------|
|
|
87
|
+
| 0 | No findings at or above the fail threshold |
|
|
88
|
+
| 1 | One or more findings at or above the fail threshold |
|
|
89
|
+
| 2 | Usage error, or agent hook blocked a write |
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## pre-commit
|
|
94
|
+
|
|
95
|
+
Add to `.pre-commit-config.yaml`:
|
|
96
|
+
|
|
97
|
+
```yaml
|
|
98
|
+
repos:
|
|
99
|
+
- repo: https://github.com/hryoma1217/mojihen
|
|
100
|
+
rev: v0.1.0
|
|
101
|
+
hooks:
|
|
102
|
+
- id: mojihen
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
This uses the bundled `.pre-commit-hooks.yaml` which runs
|
|
106
|
+
`mojihen --fail-on high` on every staged file.
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## GitHub Action
|
|
111
|
+
|
|
112
|
+
```yaml
|
|
113
|
+
# .github/workflows/mojihen.yml
|
|
114
|
+
name: CJK corruption check
|
|
115
|
+
on: [push, pull_request]
|
|
116
|
+
|
|
117
|
+
jobs:
|
|
118
|
+
mojihen:
|
|
119
|
+
runs-on: ubuntu-latest
|
|
120
|
+
steps:
|
|
121
|
+
- uses: actions/checkout@v4
|
|
122
|
+
- uses: hryoma1217/mojihen@v0.1.0
|
|
123
|
+
with:
|
|
124
|
+
paths: src/
|
|
125
|
+
fail-on: high
|
|
126
|
+
format: sarif
|
|
127
|
+
sarif-output: mojihen.sarif
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
Findings appear in the GitHub Security tab (code scanning).
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Agent hook (Claude Code / Codex)
|
|
135
|
+
|
|
136
|
+
The killer use-case: scan **just-written text** before it reaches the filesystem,
|
|
137
|
+
and bounce corrupt output back to the model immediately.
|
|
138
|
+
|
|
139
|
+
### Claude Code (PostToolUse)
|
|
140
|
+
|
|
141
|
+
In `.claude/settings.json`:
|
|
142
|
+
|
|
143
|
+
```json
|
|
144
|
+
{
|
|
145
|
+
"hooks": {
|
|
146
|
+
"PostToolUse": [
|
|
147
|
+
{
|
|
148
|
+
"matcher": "Write|Edit",
|
|
149
|
+
"hooks": [
|
|
150
|
+
{ "type": "command", "command": "mojihen hook --stdin" }
|
|
151
|
+
]
|
|
152
|
+
}
|
|
153
|
+
]
|
|
154
|
+
}
|
|
155
|
+
}
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### Codex
|
|
159
|
+
|
|
160
|
+
In `.codex/config.toml`:
|
|
161
|
+
|
|
162
|
+
```toml
|
|
163
|
+
[hooks]
|
|
164
|
+
post_write = "mojihen hook --stdin"
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
### What happens on corruption
|
|
168
|
+
|
|
169
|
+
```
|
|
170
|
+
mojihen: BLOCKED - LLM CJK corruption detected
|
|
171
|
+
|
|
172
|
+
src/strings.py:3:18 MH001 HIGH '闾' -> likely: 閾
|
|
173
|
+
src/strings.py:5:12 MH001 HIGH '耒' -> likely: 耐
|
|
174
|
+
|
|
175
|
+
Verify the intended CJK text and rewrite before proceeding.
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
The hook exits 2; the agent sees the block reason and retries with corrected text.
|
|
179
|
+
|
|
180
|
+
See `hooks/claude-code.md` and `hooks/codex.md` for full setup instructions.
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## Configuration
|
|
185
|
+
|
|
186
|
+
Create `mojihen.toml` in your project root (or `[tool.mojihen]` in `pyproject.toml`):
|
|
187
|
+
|
|
188
|
+
```toml
|
|
189
|
+
# mojihen.toml
|
|
190
|
+
fail_on = "high" # "high" | "medium"
|
|
191
|
+
langs = ["ja", "zh", "ko"]
|
|
192
|
+
extract = "auto" # "auto" (type-aware) | "all-text"
|
|
193
|
+
allow = [] # literal strings/chars to never flag
|
|
194
|
+
corpus = [] # extra corpus JSON paths
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
### Inline suppression
|
|
198
|
+
|
|
199
|
+
Suppress findings on a specific line:
|
|
200
|
+
|
|
201
|
+
```python
|
|
202
|
+
# Intentional use of the archaic character (corpus fixture)
|
|
203
|
+
FIXTURE = "闾" # mojihen: ignore
|
|
204
|
+
|
|
205
|
+
# Suppress only a specific rule
|
|
206
|
+
FIXTURE = "闾" # mojihen: ignore[MH001]
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
## How the corpus works
|
|
212
|
+
|
|
213
|
+
`src/mojihen/data/seed.json` is a versioned, schema-validated list of known-wrong chars:
|
|
214
|
+
|
|
215
|
+
```json
|
|
216
|
+
{
|
|
217
|
+
"version": 1,
|
|
218
|
+
"entries": [
|
|
219
|
+
{
|
|
220
|
+
"wrong": "闾",
|
|
221
|
+
"intended": ["閾"],
|
|
222
|
+
"lang": "ja",
|
|
223
|
+
"class": "rare_drift",
|
|
224
|
+
"evidence": "observed in LLM Japanese output",
|
|
225
|
+
"confidence": "high"
|
|
226
|
+
}
|
|
227
|
+
]
|
|
228
|
+
}
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
### Confidence tiers
|
|
232
|
+
|
|
233
|
+
| Tier | Meaning | CLI behaviour |
|
|
234
|
+
|------|---------|---------------|
|
|
235
|
+
| `high` | Rare char; near-zero false positives | Fails CI by default |
|
|
236
|
+
| `medium` | Somewhat common; context-dependent | Warns; optionally fails |
|
|
237
|
+
| `low` | Common char; production evidence but ambiguous | Info only |
|
|
238
|
+
|
|
239
|
+
High-confidence entries are chars like `闾` (U+95FE) that are essentially absent
|
|
240
|
+
from modern Japanese/Chinese text and almost certainly signal LLM drift.
|
|
241
|
+
Common kanji like `感` are kept at `low` to avoid flooding legitimate text with
|
|
242
|
+
false positives.
|
|
243
|
+
|
|
244
|
+
### Contributing a new entry
|
|
245
|
+
|
|
246
|
+
1. Confirm the wrong char is a known-bad substitution with evidence (build log,
|
|
247
|
+
diff, screenshot).
|
|
248
|
+
2. Confirm `wrong != intended` and both contain valid CJK.
|
|
249
|
+
3. Choose `"confidence": "high"` only if the wrong char is rare in normal text.
|
|
250
|
+
4. Add to `src/mojihen/data/seed.json` and run: `python -m unittest discover -s tests`
|
|
251
|
+
5. The precision gate (`test_precision.py`) must still pass with zero MH001 high
|
|
252
|
+
findings on the clean fixture sentences.
|
|
253
|
+
|
|
254
|
+
---
|
|
255
|
+
|
|
256
|
+
## Detectors
|
|
257
|
+
|
|
258
|
+
| ID | Name | Confidence |
|
|
259
|
+
|----|------|------------|
|
|
260
|
+
| MH001 | Corpus hit | high/medium/low (per entry) |
|
|
261
|
+
| MH002 | Mixed-script token (Han + Latin/Cyrillic in one identifier) | medium |
|
|
262
|
+
| MH003 | Isolated CJK in ASCII identifier / key / URL | medium |
|
|
263
|
+
| MH004 | Rare/archaic codepoint (needs Unihan freq table) | deferred |
|
|
264
|
+
| MH005 | Decomposition garble (needs radical table) | deferred |
|
|
265
|
+
|
|
266
|
+
MH004 and MH005 are deferred in v1 — the known MH005 cases (耒耗, etc.) are
|
|
267
|
+
already covered by individual MH001 corpus entries.
|
|
268
|
+
|
|
269
|
+
---
|
|
270
|
+
|
|
271
|
+
## Escape decoding
|
|
272
|
+
|
|
273
|
+
mojihen decodes all escape forms **before** inspecting text, because LLMs
|
|
274
|
+
frequently emit corrupted characters as `\uXXXX` escapes:
|
|
275
|
+
|
|
276
|
+
| Form | Example | Decoded |
|
|
277
|
+
|------|---------|---------|
|
|
278
|
+
| `\uXXXX` | `\u95FE` | 闾 |
|
|
279
|
+
| `\u{XXXXXX}` | `\u{95FE}` | 闾 |
|
|
280
|
+
| Surrogate pair | `\uD83D\uDE00` | 😀 |
|
|
281
|
+
| `\xXX` | `\x41` | A |
|
|
282
|
+
| HTML decimal | `闾` | 闾 |
|
|
283
|
+
| HTML hex | `闾` | 闾 |
|
|
284
|
+
| Named entity | `&` | & |
|
|
285
|
+
|
|
286
|
+
---
|
|
287
|
+
|
|
288
|
+
## Limitations and false-positive controls
|
|
289
|
+
|
|
290
|
+
- **Common kanji**: Characters like `感` (feeling), `末` (end), `士` (person)
|
|
291
|
+
appear in thousands of legitimate Japanese words. They are only added to the
|
|
292
|
+
corpus at `low` confidence. Use `--fail-on high` (the default) to avoid noise.
|
|
293
|
+
- **Context-free**: mojihen does not understand grammar or intent — it pattern-
|
|
294
|
+
matches against a corpus. False positives in unusual text can be suppressed
|
|
295
|
+
with `allow = [...]` in config or inline `# mojihen: ignore`.
|
|
296
|
+
- **MH002/MH003** are medium-confidence and require `--fail-on medium` to fail CI.
|
|
297
|
+
They are informational by default.
|
|
298
|
+
- The clean-corpus precision gate (`tests/test_precision.py`) must stay green;
|
|
299
|
+
this is the automated false-positive guard.
|
|
300
|
+
|
|
301
|
+
---
|
|
302
|
+
|
|
303
|
+
## 日本語について (Japanese section)
|
|
304
|
+
|
|
305
|
+
`mojihen`(文字変)は、LLMが生成した日本語・中国語・韓国語のテキストに含まれる
|
|
306
|
+
「正しいUnicodeコードポイントだが意図と異なる漢字」を検出するリンターです。
|
|
307
|
+
|
|
308
|
+
grepや単体テストではこの種の文字化けを検出できません。なぜなら間違った文字も
|
|
309
|
+
正規のUnicodeであり、テストはすでに化けた値に対して書かれているからです。
|
|
310
|
+
|
|
311
|
+
`mojihen`は既知の誤用パターンを収録したコーパス(`src/mojihen/data/seed.json`)と、
|
|
312
|
+
エスケープ形式(`\uXXXX`、`&#NNNN;`等)のデコードを組み合わせて、
|
|
313
|
+
CI・pre-commit・AIエージェントのフック(PostToolUse)として動作します。
|
|
314
|
+
|
|
315
|
+
---
|
|
316
|
+
|
|
317
|
+
## Development
|
|
318
|
+
|
|
319
|
+
```bash
|
|
320
|
+
git clone https://github.com/hryoma1217/mojihen
|
|
321
|
+
cd mojihen
|
|
322
|
+
pip install -e ".[dev]"
|
|
323
|
+
python -m unittest discover -s tests -v
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
---
|
|
327
|
+
|
|
328
|
+
## License
|
|
329
|
+
|
|
330
|
+
MIT. Copyright 2026 hryoma1217.
|