mojihen 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
mojihen-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 hryoma1217
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
mojihen-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,355 @@
1
+ Metadata-Version: 2.4
2
+ Name: mojihen
3
+ Version: 0.1.0
4
+ Summary: LLM-generated CJK corruption linter — catches valid-but-wrong kanji/hanzi that grep and tests miss
5
+ License: MIT
6
+ Keywords: cjk,japanese,linter,llm,unicode,corruption,pre-commit,ci
7
+ Classifier: Development Status :: 3 - Alpha
8
+ Classifier: Environment :: Console
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Programming Language :: Python :: 3.9
13
+ Classifier: Programming Language :: Python :: 3.10
14
+ Classifier: Programming Language :: Python :: 3.11
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Classifier: Topic :: Software Development :: Quality Assurance
17
+ Classifier: Topic :: Text Processing :: Linguistic
18
+ Requires-Python: >=3.9
19
+ Description-Content-Type: text/markdown
20
+ License-File: LICENSE
21
+ Requires-Dist: tomli>=1.1.0; python_version < "3.11"
22
+ Provides-Extra: dev
23
+ Requires-Dist: pytest; extra == "dev"
24
+ Dynamic: license-file
25
+
26
+ # mojihen
27
+
28
+ **LLM-generated CJK corruption linter.** Catches *valid-but-wrong* kanji, hanzi,
29
+ and hangul that language models emit silently — the class of bug that grep, unit
30
+ tests, and every existing Unicode safety tool passes as **false-green**.
31
+
32
+ ```
33
+ demo/sample.py:20:1 MH001 HIGH '闾' -> likely: 閾
34
+ '闾' is a known LLM corruption (likely intended: 閾) [rare_drift]
35
+ demo/sample.py:23:1 MH001 HIGH '耒' -> likely: 耐
36
+ '耒' is a known LLM corruption (likely intended: 耐) [decomposition]
37
+ ```
38
+
39
+ ---
40
+
41
+ ## The problem
42
+
43
+ When an LLM writes Japanese, Chinese, or Korean copy, it does not corrupt bytes —
44
+ it substitutes a **real, valid character** that looks or sounds close to the
45
+ intended one. The wrong glyph is itself a legitimate Unicode codepoint.
46
+
47
+ ### Six observed cases (LLM-generated Japanese)
48
+
49
+ | Intended | LLM emitted | Class | Why it hid |
50
+ |---|---|---|---|
51
+ | 閾 (threshold) | 闾 (village gate, U+95FE) | rare drift | 閾 is uncommon; LLM drifted to adjacent codepoint |
52
+ | 耐 (endure) | 耒 (plow radical, U+8012) | decomposition | 耐→耒耗 radical fragment; 耒 alone near-absent in modern JA |
53
+ | 滞 (stagnation) | 滹 (river name, U+6EF9) | radical | Radical visual confusion |
54
+ | 亊 (rare variant) | 事 (matter) | rare variant | U+4E8A vs U+4E8B, adjacent, visually identical |
55
+ | 愛 (love) | 感 (feeling) | visual/semantic | Both common; low-confidence in corpus (see below) |
56
+ | 敏 (nimble) | 敢 (bold) | shape | Stroke near-miss; low-confidence |
57
+
58
+ ### Why existing tools miss it
59
+
60
+ - **grep / ripgrep**: searches for the *intended* string; the wrong glyph simply
61
+ does not match. Silent.
62
+ - **Unit tests**: assertions were written against the already-corrupted value.
63
+ They pass. This actually happened.
64
+ - **Unicode safety linters** (`bidichk`, `anti-trojan-source`, `unicode-safety-check`):
65
+ target *adversarial* unicode (invisible chars, bidi overrides, homoglyphs). These
66
+ substitutions are visible, in-script, non-adversarial. Out of scope for those tools.
67
+ - **Chinese Spell Check (CSC) research**: models that correct *human* typos;
68
+ not packaged as a dev linter / CI gate / agent hook.
69
+
70
+ mojihen is **first-in-category** for this failure mode.
71
+
72
+ ---
73
+
74
+ ## Install
75
+
76
+ ```bash
77
+ pip install mojihen
78
+ ```
79
+
80
+ Python 3.9+ required. Zero runtime dependencies beyond stdlib.
81
+ (`tomllib` is used on Python 3.11+; on older versions, config file parsing
82
+ gracefully degrades to defaults if `tomli` is not installed.)
83
+
84
+ ---
85
+
86
+ ## CLI usage
87
+
88
+ ```bash
89
+ # Scan a file or directory
90
+ mojihen src/
91
+
92
+ # Scan with explicit options
93
+ mojihen src/ --format tty --fail-on high
94
+
95
+ # Output machine-readable JSON
96
+ mojihen src/ --format json > findings.json
97
+
98
+ # Output SARIF (for GitHub code scanning)
99
+ mojihen src/ --format sarif > mojihen.sarif
100
+
101
+ # Scan all text (bypass type-aware extraction)
102
+ mojihen src/ --all-text
103
+
104
+ # Use a custom config
105
+ mojihen src/ --config path/to/mojihen.toml
106
+ ```
107
+
108
+ ### Exit codes
109
+
110
+ | Code | Meaning |
111
+ |------|---------|
112
+ | 0 | No findings at or above the fail threshold |
113
+ | 1 | One or more findings at or above the fail threshold |
114
+ | 2 | Usage error, or agent hook blocked a write |
115
+
116
+ ---
117
+
118
+ ## pre-commit
119
+
120
+ Add to `.pre-commit-config.yaml`:
121
+
122
+ ```yaml
123
+ repos:
124
+ - repo: https://github.com/hryoma1217/mojihen
125
+ rev: v0.1.0
126
+ hooks:
127
+ - id: mojihen
128
+ ```
129
+
130
+ This uses the bundled `.pre-commit-hooks.yaml` which runs
131
+ `mojihen --fail-on high` on every staged file.
132
+
133
+ ---
134
+
135
+ ## GitHub Action
136
+
137
+ ```yaml
138
+ # .github/workflows/mojihen.yml
139
+ name: CJK corruption check
140
+ on: [push, pull_request]
141
+
142
+ jobs:
143
+ mojihen:
144
+ runs-on: ubuntu-latest
145
+ steps:
146
+ - uses: actions/checkout@v4
147
+ - uses: hryoma1217/mojihen@v0.1.0
148
+ with:
149
+ paths: src/
150
+ fail-on: high
151
+ format: sarif
152
+ sarif-output: mojihen.sarif
153
+ ```
154
+
155
+ Findings appear in the GitHub Security tab (code scanning).
156
+
157
+ ---
158
+
159
+ ## Agent hook (Claude Code / Codex)
160
+
161
+ The killer use-case: scan **just-written text** before it reaches the filesystem,
162
+ and bounce corrupt output back to the model immediately.
163
+
164
+ ### Claude Code (PostToolUse)
165
+
166
+ In `.claude/settings.json`:
167
+
168
+ ```json
169
+ {
170
+ "hooks": {
171
+ "PostToolUse": [
172
+ {
173
+ "matcher": "Write|Edit",
174
+ "hooks": [
175
+ { "type": "command", "command": "mojihen hook --stdin" }
176
+ ]
177
+ }
178
+ ]
179
+ }
180
+ }
181
+ ```
182
+
183
+ ### Codex
184
+
185
+ In `.codex/config.toml`:
186
+
187
+ ```toml
188
+ [hooks]
189
+ post_write = "mojihen hook --stdin"
190
+ ```
191
+
192
+ ### What happens on corruption
193
+
194
+ ```
195
+ mojihen: BLOCKED - LLM CJK corruption detected
196
+
197
+ src/strings.py:3:18 MH001 HIGH '闾' -> likely: 閾
198
+ src/strings.py:5:12 MH001 HIGH '耒' -> likely: 耐
199
+
200
+ Verify the intended CJK text and rewrite before proceeding.
201
+ ```
202
+
203
+ The hook exits 2; the agent sees the block reason and retries with corrected text.
204
+
205
+ See `hooks/claude-code.md` and `hooks/codex.md` for full setup instructions.
206
+
207
+ ---
208
+
209
+ ## Configuration
210
+
211
+ Create `mojihen.toml` in your project root (or `[tool.mojihen]` in `pyproject.toml`):
212
+
213
+ ```toml
214
+ # mojihen.toml
215
+ fail_on = "high" # "high" | "medium"
216
+ langs = ["ja", "zh", "ko"]
217
+ extract = "auto" # "auto" (type-aware) | "all-text"
218
+ allow = [] # literal strings/chars to never flag
219
+ corpus = [] # extra corpus JSON paths
220
+ ```
221
+
222
+ ### Inline suppression
223
+
224
+ Suppress findings on a specific line:
225
+
226
+ ```python
227
+ # Intentional use of the archaic character (corpus fixture)
228
+ FIXTURE = "闾" # mojihen: ignore
229
+
230
+ # Suppress only a specific rule
231
+ FIXTURE = "闾" # mojihen: ignore[MH001]
232
+ ```
233
+
234
+ ---
235
+
236
+ ## How the corpus works
237
+
238
+ `src/mojihen/data/seed.json` is a versioned, schema-validated list of known-wrong chars:
239
+
240
+ ```json
241
+ {
242
+ "version": 1,
243
+ "entries": [
244
+ {
245
+ "wrong": "闾",
246
+ "intended": ["閾"],
247
+ "lang": "ja",
248
+ "class": "rare_drift",
249
+ "evidence": "observed in LLM Japanese output",
250
+ "confidence": "high"
251
+ }
252
+ ]
253
+ }
254
+ ```
255
+
256
+ ### Confidence tiers
257
+
258
+ | Tier | Meaning | CLI behaviour |
259
+ |------|---------|---------------|
260
+ | `high` | Rare char; near-zero false positives | Fails CI by default |
261
+ | `medium` | Somewhat common; context-dependent | Warns; optionally fails |
262
+ | `low` | Common char; production evidence but ambiguous | Info only |
263
+
264
+ High-confidence entries are chars like `闾` (U+95FE) that are essentially absent
265
+ from modern Japanese/Chinese text and almost certainly signal LLM drift.
266
+ Common kanji like `感` are kept at `low` to avoid flooding legitimate text with
267
+ false positives.
268
+
269
+ ### Contributing a new entry
270
+
271
+ 1. Confirm the wrong char is a known-bad substitution with evidence (build log,
272
+ diff, screenshot).
273
+ 2. Confirm `wrong != intended` and both contain valid CJK.
274
+ 3. Choose `"confidence": "high"` only if the wrong char is rare in normal text.
275
+ 4. Add to `src/mojihen/data/seed.json` and run: `python -m unittest discover -s tests`
276
+ 5. The precision gate (`test_precision.py`) must still pass with zero MH001 high
277
+ findings on the clean fixture sentences.
278
+
279
+ ---
280
+
281
+ ## Detectors
282
+
283
+ | ID | Name | Confidence |
284
+ |----|------|------------|
285
+ | MH001 | Corpus hit | high/medium/low (per entry) |
286
+ | MH002 | Mixed-script token (Han + Latin/Cyrillic in one identifier) | medium |
287
+ | MH003 | Isolated CJK in ASCII identifier / key / URL | medium |
288
+ | MH004 | Rare/archaic codepoint (needs Unihan freq table) | deferred |
289
+ | MH005 | Decomposition garble (needs radical table) | deferred |
290
+
291
+ MH004 and MH005 are deferred in v1 — the known MH005 cases (耒耗, etc.) are
292
+ already covered by individual MH001 corpus entries.
293
+
294
+ ---
295
+
296
+ ## Escape decoding
297
+
298
+ mojihen decodes all escape forms **before** inspecting text, because LLMs
299
+ frequently emit corrupted characters as `\uXXXX` escapes:
300
+
301
+ | Form | Example | Decoded |
302
+ |------|---------|---------|
303
+ | `\uXXXX` | `\u95FE` | 闾 |
304
+ | `\u{XXXXXX}` | `\u{95FE}` | 闾 |
305
+ | Surrogate pair | `\uD83D\uDE00` | 😀 |
306
+ | `\xXX` | `\x41` | A |
307
+ | HTML decimal | `&#38398;` | 闾 |
308
+ | HTML hex | `&#x95FE;` | 闾 |
309
+ | Named entity | `&amp;` | & |
310
+
311
+ ---
312
+
313
+ ## Limitations and false-positive controls
314
+
315
+ - **Common kanji**: Characters like `感` (feeling), `末` (end), `士` (person)
316
+ appear in thousands of legitimate Japanese words. They are only added to the
317
+ corpus at `low` confidence. Use `--fail-on high` (the default) to avoid noise.
318
+ - **Context-free**: mojihen does not understand grammar or intent — it pattern-
319
+ matches against a corpus. False positives in unusual text can be suppressed
320
+ with `allow = [...]` in config or inline `# mojihen: ignore`.
321
+ - **MH002/MH003** are medium-confidence and require `--fail-on medium` to fail CI.
322
+ They are informational by default.
323
+ - The clean-corpus precision gate (`tests/test_precision.py`) must stay green;
324
+ this is the automated false-positive guard.
325
+
326
+ ---
327
+
328
+ ## 日本語について (Japanese section)
329
+
330
+ `mojihen`(文字変)は、LLMが生成した日本語・中国語・韓国語のテキストに含まれる
331
+ 「正しいUnicodeコードポイントだが意図と異なる漢字」を検出するリンターです。
332
+
333
+ grepや単体テストではこの種の文字化けを検出できません。なぜなら間違った文字も
334
+ 正規のUnicodeであり、テストはすでに化けた値に対して書かれているからです。
335
+
336
+ `mojihen`は既知の誤用パターンを収録したコーパス(`src/mojihen/data/seed.json`)と、
337
+ エスケープ形式(`\uXXXX`、`&#NNNN;`等)のデコードを組み合わせて、
338
+ CI・pre-commit・AIエージェントのフック(PostToolUse)として動作します。
339
+
340
+ ---
341
+
342
+ ## Development
343
+
344
+ ```bash
345
+ git clone https://github.com/hryoma1217/mojihen
346
+ cd mojihen
347
+ pip install -e ".[dev]"
348
+ python -m unittest discover -s tests -v
349
+ ```
350
+
351
+ ---
352
+
353
+ ## License
354
+
355
+ MIT. Copyright 2026 hryoma1217.
@@ -0,0 +1,330 @@
1
+ # mojihen
2
+
3
+ **LLM-generated CJK corruption linter.** Catches *valid-but-wrong* kanji, hanzi,
4
+ and hangul that language models emit silently — the class of bug that grep, unit
5
+ tests, and every existing Unicode safety tool passes as **false-green**.
6
+
7
+ ```
8
+ demo/sample.py:20:1 MH001 HIGH '闾' -> likely: 閾
9
+ '闾' is a known LLM corruption (likely intended: 閾) [rare_drift]
10
+ demo/sample.py:23:1 MH001 HIGH '耒' -> likely: 耐
11
+ '耒' is a known LLM corruption (likely intended: 耐) [decomposition]
12
+ ```
13
+
14
+ ---
15
+
16
+ ## The problem
17
+
18
+ When an LLM writes Japanese, Chinese, or Korean copy, it does not corrupt bytes —
19
+ it substitutes a **real, valid character** that looks or sounds close to the
20
+ intended one. The wrong glyph is itself a legitimate Unicode codepoint.
21
+
22
+ ### Six observed cases (LLM-generated Japanese)
23
+
24
+ | Intended | LLM emitted | Class | Why it hid |
25
+ |---|---|---|---|
26
+ | 閾 (threshold) | 闾 (village gate, U+95FE) | rare drift | 閾 is uncommon; LLM drifted to adjacent codepoint |
27
+ | 耐 (endure) | 耒 (plow radical, U+8012) | decomposition | 耐→耒耗 radical fragment; 耒 alone near-absent in modern JA |
28
+ | 滞 (stagnation) | 滹 (river name, U+6EF9) | radical | Radical visual confusion |
29
+ | 亊 (rare variant) | 事 (matter) | rare variant | U+4E8A vs U+4E8B, adjacent, visually identical |
30
+ | 愛 (love) | 感 (feeling) | visual/semantic | Both common; low-confidence in corpus (see below) |
31
+ | 敏 (nimble) | 敢 (bold) | shape | Stroke near-miss; low-confidence |
32
+
33
+ ### Why existing tools miss it
34
+
35
+ - **grep / ripgrep**: searches for the *intended* string; the wrong glyph simply
36
+ does not match. Silent.
37
+ - **Unit tests**: assertions were written against the already-corrupted value.
38
+ They pass. This actually happened.
39
+ - **Unicode safety linters** (`bidichk`, `anti-trojan-source`, `unicode-safety-check`):
40
+ target *adversarial* unicode (invisible chars, bidi overrides, homoglyphs). These
41
+ substitutions are visible, in-script, non-adversarial. Out of scope for those tools.
42
+ - **Chinese Spell Check (CSC) research**: models that correct *human* typos;
43
+ not packaged as a dev linter / CI gate / agent hook.
44
+
45
+ mojihen is **first-in-category** for this failure mode.
46
+
47
+ ---
48
+
49
+ ## Install
50
+
51
+ ```bash
52
+ pip install mojihen
53
+ ```
54
+
55
+ Python 3.9+ required. Zero runtime dependencies beyond stdlib.
56
+ (`tomllib` is used on Python 3.11+; on older versions, config file parsing
57
+ gracefully degrades to defaults if `tomli` is not installed.)
58
+
59
+ ---
60
+
61
+ ## CLI usage
62
+
63
+ ```bash
64
+ # Scan a file or directory
65
+ mojihen src/
66
+
67
+ # Scan with explicit options
68
+ mojihen src/ --format tty --fail-on high
69
+
70
+ # Output machine-readable JSON
71
+ mojihen src/ --format json > findings.json
72
+
73
+ # Output SARIF (for GitHub code scanning)
74
+ mojihen src/ --format sarif > mojihen.sarif
75
+
76
+ # Scan all text (bypass type-aware extraction)
77
+ mojihen src/ --all-text
78
+
79
+ # Use a custom config
80
+ mojihen src/ --config path/to/mojihen.toml
81
+ ```
82
+
83
+ ### Exit codes
84
+
85
+ | Code | Meaning |
86
+ |------|---------|
87
+ | 0 | No findings at or above the fail threshold |
88
+ | 1 | One or more findings at or above the fail threshold |
89
+ | 2 | Usage error, or agent hook blocked a write |
90
+
91
+ ---
92
+
93
+ ## pre-commit
94
+
95
+ Add to `.pre-commit-config.yaml`:
96
+
97
+ ```yaml
98
+ repos:
99
+ - repo: https://github.com/hryoma1217/mojihen
100
+ rev: v0.1.0
101
+ hooks:
102
+ - id: mojihen
103
+ ```
104
+
105
+ This uses the bundled `.pre-commit-hooks.yaml` which runs
106
+ `mojihen --fail-on high` on every staged file.
107
+
108
+ ---
109
+
110
+ ## GitHub Action
111
+
112
+ ```yaml
113
+ # .github/workflows/mojihen.yml
114
+ name: CJK corruption check
115
+ on: [push, pull_request]
116
+
117
+ jobs:
118
+ mojihen:
119
+ runs-on: ubuntu-latest
120
+ steps:
121
+ - uses: actions/checkout@v4
122
+ - uses: hryoma1217/mojihen@v0.1.0
123
+ with:
124
+ paths: src/
125
+ fail-on: high
126
+ format: sarif
127
+ sarif-output: mojihen.sarif
128
+ ```
129
+
130
+ Findings appear in the GitHub Security tab (code scanning).
131
+
132
+ ---
133
+
134
+ ## Agent hook (Claude Code / Codex)
135
+
136
+ The killer use-case: scan **just-written text** before it reaches the filesystem,
137
+ and bounce corrupt output back to the model immediately.
138
+
139
+ ### Claude Code (PostToolUse)
140
+
141
+ In `.claude/settings.json`:
142
+
143
+ ```json
144
+ {
145
+ "hooks": {
146
+ "PostToolUse": [
147
+ {
148
+ "matcher": "Write|Edit",
149
+ "hooks": [
150
+ { "type": "command", "command": "mojihen hook --stdin" }
151
+ ]
152
+ }
153
+ ]
154
+ }
155
+ }
156
+ ```
157
+
158
+ ### Codex
159
+
160
+ In `.codex/config.toml`:
161
+
162
+ ```toml
163
+ [hooks]
164
+ post_write = "mojihen hook --stdin"
165
+ ```
166
+
167
+ ### What happens on corruption
168
+
169
+ ```
170
+ mojihen: BLOCKED - LLM CJK corruption detected
171
+
172
+ src/strings.py:3:18 MH001 HIGH '闾' -> likely: 閾
173
+ src/strings.py:5:12 MH001 HIGH '耒' -> likely: 耐
174
+
175
+ Verify the intended CJK text and rewrite before proceeding.
176
+ ```
177
+
178
+ The hook exits 2; the agent sees the block reason and retries with corrected text.
179
+
180
+ See `hooks/claude-code.md` and `hooks/codex.md` for full setup instructions.
181
+
182
+ ---
183
+
184
+ ## Configuration
185
+
186
+ Create `mojihen.toml` in your project root (or `[tool.mojihen]` in `pyproject.toml`):
187
+
188
+ ```toml
189
+ # mojihen.toml
190
+ fail_on = "high" # "high" | "medium"
191
+ langs = ["ja", "zh", "ko"]
192
+ extract = "auto" # "auto" (type-aware) | "all-text"
193
+ allow = [] # literal strings/chars to never flag
194
+ corpus = [] # extra corpus JSON paths
195
+ ```
196
+
197
+ ### Inline suppression
198
+
199
+ Suppress findings on a specific line:
200
+
201
+ ```python
202
+ # Intentional use of the archaic character (corpus fixture)
203
+ FIXTURE = "闾" # mojihen: ignore
204
+
205
+ # Suppress only a specific rule
206
+ FIXTURE = "闾" # mojihen: ignore[MH001]
207
+ ```
208
+
209
+ ---
210
+
211
+ ## How the corpus works
212
+
213
+ `src/mojihen/data/seed.json` is a versioned, schema-validated list of known-wrong chars:
214
+
215
+ ```json
216
+ {
217
+ "version": 1,
218
+ "entries": [
219
+ {
220
+ "wrong": "闾",
221
+ "intended": ["閾"],
222
+ "lang": "ja",
223
+ "class": "rare_drift",
224
+ "evidence": "observed in LLM Japanese output",
225
+ "confidence": "high"
226
+ }
227
+ ]
228
+ }
229
+ ```
230
+
231
+ ### Confidence tiers
232
+
233
+ | Tier | Meaning | CLI behaviour |
234
+ |------|---------|---------------|
235
+ | `high` | Rare char; near-zero false positives | Fails CI by default |
236
+ | `medium` | Somewhat common; context-dependent | Warns; optionally fails |
237
+ | `low` | Common char; production evidence but ambiguous | Info only |
238
+
239
+ High-confidence entries are chars like `闾` (U+95FE) that are essentially absent
240
+ from modern Japanese/Chinese text and almost certainly signal LLM drift.
241
+ Common kanji like `感` are kept at `low` to avoid flooding legitimate text with
242
+ false positives.
243
+
244
+ ### Contributing a new entry
245
+
246
+ 1. Confirm the wrong char is a known-bad substitution with evidence (build log,
247
+ diff, screenshot).
248
+ 2. Confirm `wrong != intended` and both contain valid CJK.
249
+ 3. Choose `"confidence": "high"` only if the wrong char is rare in normal text.
250
+ 4. Add to `src/mojihen/data/seed.json` and run: `python -m unittest discover -s tests`
251
+ 5. The precision gate (`test_precision.py`) must still pass with zero MH001 high
252
+ findings on the clean fixture sentences.
253
+
254
+ ---
255
+
256
+ ## Detectors
257
+
258
+ | ID | Name | Confidence |
259
+ |----|------|------------|
260
+ | MH001 | Corpus hit | high/medium/low (per entry) |
261
+ | MH002 | Mixed-script token (Han + Latin/Cyrillic in one identifier) | medium |
262
+ | MH003 | Isolated CJK in ASCII identifier / key / URL | medium |
263
+ | MH004 | Rare/archaic codepoint (needs Unihan freq table) | deferred |
264
+ | MH005 | Decomposition garble (needs radical table) | deferred |
265
+
266
+ MH004 and MH005 are deferred in v1 — the known MH005 cases (耒耗, etc.) are
267
+ already covered by individual MH001 corpus entries.
268
+
269
+ ---
270
+
271
+ ## Escape decoding
272
+
273
+ mojihen decodes all escape forms **before** inspecting text, because LLMs
274
+ frequently emit corrupted characters as `\uXXXX` escapes:
275
+
276
+ | Form | Example | Decoded |
277
+ |------|---------|---------|
278
+ | `\uXXXX` | `\u95FE` | 闾 |
279
+ | `\u{XXXXXX}` | `\u{95FE}` | 闾 |
280
+ | Surrogate pair | `\uD83D\uDE00` | 😀 |
281
+ | `\xXX` | `\x41` | A |
282
+ | HTML decimal | `&#38398;` | 闾 |
283
+ | HTML hex | `&#x95FE;` | 闾 |
284
+ | Named entity | `&amp;` | & |
285
+
286
+ ---
287
+
288
+ ## Limitations and false-positive controls
289
+
290
+ - **Common kanji**: Characters like `感` (feeling), `末` (end), `士` (person)
291
+ appear in thousands of legitimate Japanese words. They are only added to the
292
+ corpus at `low` confidence. Use `--fail-on high` (the default) to avoid noise.
293
+ - **Context-free**: mojihen does not understand grammar or intent — it pattern-
294
+ matches against a corpus. False positives in unusual text can be suppressed
295
+ with `allow = [...]` in config or inline `# mojihen: ignore`.
296
+ - **MH002/MH003** are medium-confidence and require `--fail-on medium` to fail CI.
297
+ They are informational by default.
298
+ - The clean-corpus precision gate (`tests/test_precision.py`) must stay green;
299
+ this is the automated false-positive guard.
300
+
301
+ ---
302
+
303
+ ## 日本語について (Japanese section)
304
+
305
+ `mojihen`(文字変)は、LLMが生成した日本語・中国語・韓国語のテキストに含まれる
306
+ 「正しいUnicodeコードポイントだが意図と異なる漢字」を検出するリンターです。
307
+
308
+ grepや単体テストではこの種の文字化けを検出できません。なぜなら間違った文字も
309
+ 正規のUnicodeであり、テストはすでに化けた値に対して書かれているからです。
310
+
311
+ `mojihen`は既知の誤用パターンを収録したコーパス(`src/mojihen/data/seed.json`)と、
312
+ エスケープ形式(`\uXXXX`、`&#NNNN;`等)のデコードを組み合わせて、
313
+ CI・pre-commit・AIエージェントのフック(PostToolUse)として動作します。
314
+
315
+ ---
316
+
317
+ ## Development
318
+
319
+ ```bash
320
+ git clone https://github.com/hryoma1217/mojihen
321
+ cd mojihen
322
+ pip install -e ".[dev]"
323
+ python -m unittest discover -s tests -v
324
+ ```
325
+
326
+ ---
327
+
328
+ ## License
329
+
330
+ MIT. Copyright 2026 hryoma1217.