patina-cli 3.11.0 → 4.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (193) hide show
  1. package/.patina.default.yaml +29 -29
  2. package/CHANGELOG.md +53 -0
  3. package/NOTICE +21 -0
  4. package/README.md +117 -224
  5. package/README_JA.md +134 -77
  6. package/README_KR.md +132 -74
  7. package/README_ZH.md +137 -80
  8. package/SKILL.md +11 -20
  9. package/artifacts/rebaseline-2025/README.md +147 -0
  10. package/artifacts/rebaseline-2025/human-controls.public.jsonl +250 -0
  11. package/artifacts/rebaseline-2025/intake.example.jsonl +2 -0
  12. package/artifacts/rebaseline-2025/intake.local.example.jsonl +25 -0
  13. package/artifacts/rebaseline-2025/prompts.template.jsonl +7 -0
  14. package/artifacts/rebaseline-2025/sources.ko-public.jsonl +39 -0
  15. package/assets/brand/patina-badge.svg +18 -0
  16. package/assets/brand/patina-mark.svg +8 -0
  17. package/assets/demo/README.md +79 -0
  18. package/core/scoring.md +12 -12
  19. package/core/standalone-prompt.md +3 -1
  20. package/core/stylometry.md +93 -22
  21. package/docs/API.md +1554 -0
  22. package/docs/AUTHENTICATION.md +50 -26
  23. package/docs/AUTHENTICATION_KR.md +54 -29
  24. package/docs/BRANDING.md +9 -8
  25. package/docs/CLI.md +55 -14
  26. package/docs/COOKBOOK.md +8 -21
  27. package/docs/DEMO.md +32 -5
  28. package/docs/EXIT-CODES.md +2 -3
  29. package/docs/FALSE-POSITIVES.md +63 -0
  30. package/docs/FAQ.md +9 -1
  31. package/docs/FAQ_KR.md +3 -1
  32. package/docs/FLAG-PARITY.md +33 -47
  33. package/docs/ISSUE-WAVES.md +57 -0
  34. package/docs/PATTERNS-EN.md +67 -3
  35. package/docs/PATTERNS-JA.md +68 -2
  36. package/docs/PATTERNS-KO.md +70 -7
  37. package/docs/PATTERNS-ZH.md +67 -3
  38. package/docs/PATTERNS.md +5 -5
  39. package/docs/RESEARCH-DOCS-PLATFORM.md +54 -0
  40. package/docs/ROADMAP.md +46 -66
  41. package/docs/TRANSLATIONESE-KO.md +51 -0
  42. package/docs/audits/2026-05-deep-research.md +3 -1
  43. package/docs/benchmarks/README.md +51 -0
  44. package/docs/benchmarks/detector-comparison.json +69 -9
  45. package/docs/benchmarks/detector-comparison.md +10 -5
  46. package/docs/benchmarks/katfish-ko-latest.json +657 -0
  47. package/docs/benchmarks/katfish-ko-latest.md +77 -0
  48. package/docs/benchmarks/latest.json +1183 -108
  49. package/docs/benchmarks/latest.md +84 -60
  50. package/docs/benchmarks/lexicon-freshness-en-2026-05-22.json +1121 -0
  51. package/docs/benchmarks/lexicon-freshness-en-2026-05-22.md +136 -0
  52. package/docs/benchmarks/rebaseline-latest.json +381 -0
  53. package/docs/benchmarks/rebaseline-latest.md +121 -0
  54. package/docs/benchmarks/register-stratified-latest.json +164 -0
  55. package/docs/benchmarks/register-stratified-latest.md +99 -0
  56. package/docs/benchmarks/register-stratified.md +43 -0
  57. package/docs/integrations/github-action.md +44 -11
  58. package/docs/integrations/playground.md +58 -0
  59. package/docs/integrations/pre-commit.md +5 -5
  60. package/docs/integrations/release.md +5 -3
  61. package/docs/integrations/static-sites.md +83 -0
  62. package/docs/research/2025-rebaseline-plan.md +71 -2
  63. package/docs/research/2026-rebaseline.md +102 -0
  64. package/docs/research/adversarial-mps.md +41 -0
  65. package/docs/research/ai-human-metrics.md +35 -23
  66. package/docs/research/human-eval-panel.md +42 -0
  67. package/docs/research/judge-agreement.md +24 -0
  68. package/docs/research/ko-2025-corpus-sources.md +135 -0
  69. package/docs/research/lexicon-freshness-audit.md +64 -0
  70. package/docs/research/zh-ja-lexicon-calibration.md +60 -0
  71. package/docs/social/patina-launch-copy.md +173 -100
  72. package/docs/social/patina-launch-execution.md +94 -0
  73. package/docs/social/patina-launch-korean-first.md +83 -0
  74. package/docs/social/signs-of-ai-writing.md +26 -0
  75. package/docs/social/signs-of-ai-writing_KR.md +26 -0
  76. package/lexicon/ai-en.md +21 -24
  77. package/lexicon/ai-ja.md +158 -0
  78. package/lexicon/ai-ko.md +9 -9
  79. package/lexicon/ai-zh.md +158 -0
  80. package/lexicon/provenance/ai-en.json +970 -0
  81. package/lexicon/provenance/ai-ja.json +542 -0
  82. package/lexicon/provenance/ai-ko.json +866 -0
  83. package/lexicon/provenance/ai-zh.json +542 -0
  84. package/package.json +49 -8
  85. package/patterns/en-communication.md +5 -0
  86. package/patterns/en-content.md +5 -0
  87. package/patterns/en-filler.md +5 -0
  88. package/patterns/en-language.md +29 -1
  89. package/patterns/en-structure.md +5 -0
  90. package/patterns/en-style.md +5 -0
  91. package/patterns/en-viral-hook.md +42 -2
  92. package/patterns/ja-communication.md +5 -0
  93. package/patterns/ja-content.md +5 -0
  94. package/patterns/ja-filler.md +5 -0
  95. package/patterns/ja-language.md +33 -1
  96. package/patterns/ja-structure.md +12 -0
  97. package/patterns/ja-style.md +5 -0
  98. package/patterns/ja-viral-hook.md +41 -2
  99. package/patterns/ko-communication.md +5 -0
  100. package/patterns/ko-content.md +5 -0
  101. package/patterns/ko-filler.md +5 -0
  102. package/patterns/ko-language.md +33 -1
  103. package/patterns/ko-structure.md +25 -6
  104. package/patterns/ko-style.md +5 -0
  105. package/patterns/ko-viral-hook.md +38 -2
  106. package/patterns/zh-communication.md +5 -0
  107. package/patterns/zh-content.md +5 -0
  108. package/patterns/zh-filler.md +5 -0
  109. package/patterns/zh-language.md +37 -1
  110. package/patterns/zh-structure.md +12 -0
  111. package/patterns/zh-style.md +5 -0
  112. package/patterns/zh-viral-hook.md +38 -2
  113. package/playground/README.md +55 -0
  114. package/playground/analytics.js +4 -0
  115. package/playground/analyzer.js +883 -0
  116. package/playground/app.js +157 -0
  117. package/playground/data/lexicons.js +343 -0
  118. package/playground/index.html +138 -0
  119. package/playground/styles.css +267 -0
  120. package/profiles/namuwiki.md +111 -0
  121. package/scripts/adversarial-mps-report.mjs +201 -0
  122. package/scripts/badge-json.mjs +79 -0
  123. package/scripts/benchmark-report.mjs +56 -9
  124. package/scripts/check-release-metadata.mjs +0 -2
  125. package/scripts/detector-comparison.mjs +7 -7
  126. package/scripts/generate-playground-data.mjs +77 -0
  127. package/scripts/katfish-calibration.mjs +464 -0
  128. package/scripts/lexicon-freshness.mjs +485 -0
  129. package/scripts/lint.mjs +1 -1
  130. package/scripts/precommit-score.mjs +4 -3
  131. package/scripts/prose-score.mjs +81 -5
  132. package/scripts/rebaseline-intake.mjs +242 -0
  133. package/scripts/rebaseline-score.mjs +268 -0
  134. package/scripts/rebaseline-summary.mjs +773 -0
  135. package/scripts/rebaseline-web-collect.mjs +410 -0
  136. package/scripts/update-benchmark-ranges.mjs +1 -0
  137. package/src/api.js +69 -105
  138. package/src/auth.js +50 -2
  139. package/src/backends/claude-cli.js +19 -4
  140. package/src/backends/codex-cli.js +19 -3
  141. package/src/backends/contract.js +230 -1
  142. package/src/backends/gemini-cli.js +18 -5
  143. package/src/backends/index.js +87 -12
  144. package/src/backends/kimi-cli.js +161 -0
  145. package/src/cli.js +577 -567
  146. package/src/commands/doctor.js +2 -2
  147. package/src/config.js +29 -0
  148. package/src/errors.js +53 -1
  149. package/src/features/discourse-tells.js +68 -0
  150. package/src/features/index.js +82 -8
  151. package/src/features/lexicon.js +40 -6
  152. package/src/features/markup-leakage.js +69 -0
  153. package/src/features/segment.js +41 -0
  154. package/src/features/signal-strength.js +81 -0
  155. package/src/features/stylometry.js +231 -1
  156. package/src/features/translationese.js +127 -0
  157. package/src/loader.js +76 -0
  158. package/src/logger.js +22 -23
  159. package/src/model-defaults.js +55 -0
  160. package/src/ouroboros.js +31 -0
  161. package/src/output.js +102 -90
  162. package/src/prompt-builder.js +103 -68
  163. package/src/providers.js +51 -4
  164. package/src/scoring.js +210 -2
  165. package/src/security.js +75 -0
  166. package/tests/fixtures/live-quality/en/public-docs-01.md +26 -0
  167. package/tests/fixtures/live-quality/ko/public-docs-01.md +26 -0
  168. package/tests/fixtures/suspect-zones/expected-ranges.json +207 -16
  169. package/tests/fixtures/suspect-zones/ja/ai/ja-ai-04-lexicon.md +11 -0
  170. package/tests/fixtures/suspect-zones/ja/natural/ja-nat-04-lexicon-cold.md +11 -0
  171. package/tests/fixtures/suspect-zones/ko/ai/ko-ai-02.md +4 -5
  172. package/tests/fixtures/suspect-zones/ko/ai/ko-ai-07-ko-diagnostic.md +11 -0
  173. package/tests/fixtures/suspect-zones/zh/ai/zh-ai-04-lexicon.md +11 -0
  174. package/tests/fixtures/suspect-zones/zh/natural/zh-nat-04-lexicon-cold.md +11 -0
  175. package/tests/quality/README.md +188 -11
  176. package/tests/quality/adversarial-mps/fixtures.jsonl +10 -0
  177. package/tests/quality/benchmark.mjs +39 -1
  178. package/tests/quality/dogfood.mjs +5 -3
  179. package/tests/quality/live-fixtures.jsonl +2 -0
  180. package/tests/quality/live-quality.mjs +596 -0
  181. package/tests/quality/ranking-metrics.mjs +136 -0
  182. package/tests/quality/rebaseline-manifest.example.jsonl +5 -0
  183. package/vercel.json +53 -0
  184. package/SKILL-MAX.md +0 -455
  185. package/docs/internal/HARNESS.md +0 -14
  186. package/docs/internal/README.md +0 -14
  187. package/docs/internal/WARP.md +0 -23
  188. package/patina-max/SKILL.md +0 -523
  189. package/patina-max/composite.py +0 -457
  190. package/src/cache.js +0 -106
  191. package/src/commands/init.js +0 -208
  192. package/src/manifest.js +0 -162
  193. package/src/max-mode.js +0 -207
@@ -2,7 +2,8 @@
2
2
 
3
3
  Status: protocol template, no external detector/vendor claims yet.
4
4
  Owner: maintainers.
5
- Related issues: #155, #163, #170, #172, #213.
5
+ Related issues: #155, #157, #160, #303, plus #158/#159 for evaluator follow-up. The #156 adversarial MPS gate is now in-tree.
6
+ Korean source inventory: `docs/research/ko-2025-corpus-sources.md`.
6
7
 
7
8
  Patina's checked-in benchmark is a deterministic regression corpus. It is useful
8
9
  for catching tokenizer, threshold, lexicon, and fixture drift. It is not enough
@@ -24,12 +25,29 @@ Minimum matrix:
24
25
  | generators | at least one GPT-family, Claude-family, Gemini-family, and open-weight model |
25
26
  | sample size | start at 25 paragraphs per language × class × register cell before publishing claims |
26
27
 
28
+ The 25-paragraph matrix is the first intake target for checking coverage holes.
29
+ It is not the public benchmark gate. The stricter claim gate remains the
30
+ `process/pattern-freshness.md` requirement: at least three generator families
31
+ across at least two languages with n≥100 per claim cell and binomial 95%
32
+ confidence intervals.
33
+
34
+
35
+ ## Execution order
36
+
37
+ 1. Start with Korean calibration (#303/#157): collect natural Korean controls for academic/종결-다, blog, product-doc, and community registers before changing KO thresholds again.
38
+ 2. Then run the model-era rebaseline (#155): score the fixed manifest across at least three generator families and at least two languages.
39
+ 3. Keep lexicon freshness separate from public catch-rate claims: English #160 remine now has per-entry provenance and ≥4× hot/cold lift evidence; rerun it during quarterly reviews, and do the same for KO/ZH/JA when paired corpora exist.
40
+ 4. Use the same manifest for cross-judge and blinded-panel follow-ups (#158/#159) instead of creating separate incompatible samples; keep the in-tree adversarial MPS gate current as a companion check.
41
+
27
42
  ## Data rules
28
43
 
29
44
  - Use redistributable prompts and generated text; do not check private user text into the repo.
30
45
  - Store full text only when redistribution is allowed. Otherwise store hashes, metadata, and metrics.
31
46
  - Keep generation metadata: model, provider, date, prompt id, decoding params, language, register, and any editing pass.
32
47
  - Separate detector-facing evaluation from rewrite-quality evaluation.
48
+ - For Korean 2025+ rows, start from `docs/research/ko-2025-corpus-sources.md`
49
+ and run the local intake helper before any public report:
50
+ `npm run benchmark:rebaseline:intake -- --input artifacts/rebaseline-2025/intake.local.jsonl --require-source-review`.
33
51
 
34
52
  ## Metrics
35
53
 
@@ -37,7 +55,7 @@ For deterministic suspect-zone detection:
37
55
 
38
56
  - accuracy, precision, recall, F1
39
57
  - Wilson 95% confidence interval per language and register
40
- - detector sub-signal breakdown: burstiness, MATTR, lexicon
58
+ - detector sub-signal breakdown: burstiness, MATTR, lexicon, koDiagnostics
41
59
  - expected metric ranges for checked-in fixtures
42
60
 
43
61
  For rewrite quality:
@@ -60,6 +78,11 @@ Recommended local/private layout before publishing sanitized summaries:
60
78
 
61
79
  ```text
62
80
  artifacts/rebaseline-2025/
81
+ ├── README.md # tracked scaffold and privacy rules
82
+ ├── intake.example.jsonl # tracked smoke fixture, not evidence
83
+ ├── intake.local.jsonl # gitignored local working input
84
+ ├── manifest.public.jsonl # gitignored sanitized output for local review
85
+ ├── human-controls.public.jsonl # tracked hash-only KO web candidate manifest
63
86
  ├── prompts.jsonl
64
87
  ├── generations.jsonl # only redistributable text
65
88
  ├── generations.private.jsonl # gitignored/private text if needed
@@ -71,6 +94,52 @@ artifacts/rebaseline-2025/
71
94
 
72
95
  Checked-in public summaries should live under `docs/benchmarks/` after review.
73
96
 
97
+ ## Manifest scaffold
98
+
99
+ Use the checked-in example manifest to validate the metadata contract before
100
+ collecting a larger corpus:
101
+
102
+ ```bash
103
+ npm run benchmark:rebaseline
104
+ npm run benchmark:rebaseline:report
105
+ node scripts/rebaseline-summary.mjs --input tests/quality/rebaseline-manifest.example.jsonl --json
106
+ npm run benchmark:rebaseline:intake -- --input artifacts/rebaseline-2025/intake.example.jsonl --dry-run
107
+ npm run benchmark:rebaseline:intake -- --input artifacts/rebaseline-2025/intake.local.example.jsonl --dry-run --require-source-review
108
+ npm run benchmark:rebaseline:web -- --target-per-register 50 --max-per-source 12 --collected-at 2026-05-22
109
+ npm run benchmark:rebaseline:score -- --input artifacts/rebaseline-2025/private/web-human-controls.generated.private.jsonl --output artifacts/rebaseline-2025/human-controls.public.jsonl --scored-at 2026-05-22
110
+ node scripts/rebaseline-summary.mjs --input artifacts/rebaseline-2025/human-controls.public.jsonl --json
111
+ ```
112
+
113
+ The manifest row schema is intentionally metadata-first:
114
+
115
+ - required: `sample_id`, `language`, `class`, `register`, `model_family`,
116
+ `provider`, `model`, `generated_at`, `prompt_id`, `decoding`, `postprocess`,
117
+ `redistribution`, `text_hash`
118
+ - optional: `text` only when redistribution permits it, `patina_score`,
119
+ `expected_hot`, `predicted_hot`, `source_url`, `source_license`,
120
+ `source_review`, `reviewer_notes`
121
+
122
+ `text_hash` uses the same `sha256:<hex>` style as runtime manifests. If a row
123
+ contains `text`, the validator checks that the digest matches. If a row is
124
+ `metadata-only`, `private`, or `no-redistribution`, full text must stay out of
125
+ the repository.
126
+
127
+ `npm run benchmark:rebaseline:report` writes the sanitized summary to
128
+ `docs/benchmarks/rebaseline-latest.md` and `.json`. The default report now uses
129
+ the #155 claim-ready 2026 manifest; `tests/quality/rebaseline-manifest.example.jsonl`
130
+ remains the small `BLOCKED` fixture for proving the gate fails incomplete
131
+ evidence.
132
+
133
+ The tracked `artifacts/rebaseline-2025/human-controls.public.jsonl` file is a
134
+ 250-row Korean web candidate manifest for validating the provenance and hash-only
135
+ path, balanced at 50 rows per tracked register. It contains no raw text;
136
+ score/outcome fields are deterministic false-positive evidence only and must not
137
+ be used as a threshold or README performance claim by themselves.
138
+
139
+ Use the false-positive form for person-written samples that should feed the
140
+ human/natural side of future calibration:
141
+ <https://github.com/devswha/patina/issues/new?template=false_positive.yml>.
142
+
74
143
  ## Publication gate
75
144
 
76
145
  Do not publish competitive claims until the report includes:
@@ -0,0 +1,102 @@
1
+ # 2026 modern-model rebaseline
2
+
3
+ Status: claim-ready for the checked-in deterministic benchmark surface as of 2026-05-22.
4
+ Related issues: #155, #157, #160, #303.
5
+
6
+ This page records the first #155-compatible rebaseline after the HC3-era claim was retired. It does not replace the broader protocol in [`2025-rebaseline-plan.md`](2025-rebaseline-plan.md); it is the current public claim surface for Korean and English only.
7
+
8
+ ## Inputs
9
+
10
+ | input | rows | public text | notes |
11
+ |---|---:|---:|---|
12
+ | `artifacts/rebaseline-2025/private/modern-generations.private.jsonl` | 600 | 0 | Locally generated raw text from logged-in CLI surfaces; kept ignored under `private/`. |
13
+ | `artifacts/rebaseline-2025/human-controls.public.jsonl` | 100 selected from 250 | 0 | Korean public-web human controls, hash-only and already scored. |
14
+ | `artifacts/rebaseline-2025/private/hape-en.private.jsonl` | 100 selected from HAP-E human rows | 0 | English HAP-E human controls, transformed to hash-only public rows. |
15
+ | `artifacts/rebaseline-2025/rebaseline-2026.scored.public.jsonl` | 800 | 0 | Public scored manifest: metadata, hashes, deterministic scores, and outcomes only. |
16
+
17
+ Modern-model positive cells are balanced at n=100 for each language × family:
18
+
19
+ | language | model family | local surface/model |
20
+ |---|---|---|
21
+ | en | gpt-family | `codex-cli` / `gpt-5.5` |
22
+ | en | claude-family | `claude-cli` / `claude-sonnet-4-6` |
23
+ | en | gemini-family | `gemini-cli` / `gemini-2.5-pro` |
24
+ | ko | gpt-family | `codex-cli` / `gpt-5.5` |
25
+ | ko | claude-family | `claude-cli` / `claude-sonnet-4-6` |
26
+ | ko | gemini-family | `gemini-cli` / `gemini-2.5-pro` |
27
+
28
+ The model rows are ordinary assistant completions requested through CLI tools, not API-temperature-controlled lab completions. Treat the result as a deterministic Patina detector rebaseline for these local surfaces, not as an authorship study.
29
+
30
+ ## Claim gate
31
+
32
+ `npm run benchmark:rebaseline:claim-manifest -- --scored-at 2026-05-22` writes the public-safe manifest, then:
33
+
34
+ ```bash
35
+ node scripts/rebaseline-summary.mjs \
36
+ --input artifacts/rebaseline-2025/rebaseline-2026.scored.public.jsonl \
37
+ --write \
38
+ --basename rebaseline-latest \
39
+ --require-claim-ready
40
+ ```
41
+
42
+ Latest report: [`docs/benchmarks/rebaseline-latest.md`](../benchmarks/rebaseline-latest.md).
43
+
44
+ | gate | result |
45
+ |---|---:|
46
+ | generator families with n≥100 positive cells | 3 |
47
+ | languages with n≥100 positive cells | 2 |
48
+ | qualified positive language × family cells | 6 |
49
+ | natural/human languages with n≥100 | 2 |
50
+ | complete expected/predicted outcome rows | 800 |
51
+ | public raw text committed | 0 |
52
+
53
+ ## Headline metrics
54
+
55
+ All intervals below are Wilson 95% confidence intervals.
56
+
57
+ | metric | value |
58
+ |---|---:|
59
+ | Overall AI catch rate | 67.3% [63.5–71.0%] |
60
+ | Overall human-control false-positive rate | 16.0% [11.6–21.7%] |
61
+ | Precision | 92.7% |
62
+ | F1 | 0.780 |
63
+ | TP/FP/FN/TN | 404/32/196/168 |
64
+
65
+ ## Catch rate by language × model family
66
+
67
+ | language | model family | n | catch rate | 95% CI | caught/missed |
68
+ |---|---|---:|---:|---:|---:|
69
+ | en | claude-family | 100 | 74.0% | 64.6%–81.6% | 74/26 |
70
+ | en | gemini-family | 100 | 79.0% | 70.0%–85.8% | 79/21 |
71
+ | en | gpt-family | 100 | 77.0% | 67.8%–84.2% | 77/23 |
72
+ | ko | claude-family | 100 | 68.0% | 58.3%–76.3% | 68/32 |
73
+ | ko | gemini-family | 100 | 62.0% | 52.2%–70.9% | 62/38 |
74
+ | ko | gpt-family | 100 | 44.0% | 34.7%–53.8% | 44/56 |
75
+
76
+ ## False-positive rate by language
77
+
78
+ | language | n | FP rate | 95% CI | FP/TN |
79
+ |---|---:|---:|---:|---:|
80
+ | en | 100 | 14.0% | 8.5%–22.1% | 14/86 |
81
+ | ko | 100 | 18.0% | 11.7%–26.7% | 18/82 |
82
+
83
+ ## Interpretation
84
+
85
+ - The old README headline, “91% Korean / 76% English,” is no longer the current claim. It came from smaller HC3-era/paired fixtures and overstated current Korean catch behavior.
86
+ - English remains around the previous HC3 headline on these modern CLI samples (74–79% per family), but Korean is materially lower, especially the GPT-family cell at 44%.
87
+ - Human-control false positives remain inside the existing ≤25% acceptance envelope overall, but register-level review is still needed before any threshold tightening.
88
+ - This is an editing-hotspot detector benchmark. It must not be presented as a reliable author-attribution or detector-bypass guarantee.
89
+
90
+ ## Reproducibility notes
91
+
92
+ - Raw generated and HAP-E text stays in ignored `artifacts/rebaseline-2025/private/` files. The committed manifest stores `sha256` digests, metadata, Patina scores, and outcome labels only.
93
+ - The generation helper is `scripts/rebaseline-generate-modern.mjs`; the public-safe combiner is `scripts/rebaseline-build-claim-manifest.mjs`.
94
+ - HAP-E is used only for English human controls in this report. The previous #160 HAP-E lexicon-mining note still applies: HAP-E is useful, but it is not a substitute for Korean or 2026 model positives.
95
+ - `docs/benchmarks/rebaseline-latest.json` is the machine-readable companion to the Markdown report.
96
+
97
+ ## Remaining research work
98
+
99
+ 1. Add lightly/heavily edited-AI rows so rewrite behavior is tested separately from raw assistant completions.
100
+ 2. Add non-KO/EN languages once public controls and generated positives reach the same n≥100 cell gate.
101
+ 3. Repeat the generation on API surfaces with explicit decoding parameters if a stricter lab-style claim is needed.
102
+ 4. Review Korean GPT-family misses before changing thresholds; the current result argues for targeted KO diagnostics rather than global score inflation.
@@ -0,0 +1,41 @@
1
+ # Adversarial MPS audit
2
+
3
+ This report checks whether a rewrite can preserve explicit meaning anchors while still looking AI-like. It is a repo-owned adversarial fixture set, not a public model-performance claim.
4
+
5
+ Fixture source: `tests/quality/adversarial-mps/fixtures.jsonl`
6
+
7
+ ## Summary
8
+
9
+ - Fixtures: 10
10
+ - Passing adversarial cases: 10/10
11
+ - Minimum anchor-MPS proxy: 100.0
12
+ - Minimum deterministic AI score: 100.0
13
+ - Gate: MPS proxy ≥90 and deterministic AI score ≥60.
14
+
15
+ ## Results
16
+
17
+ | id | lang | register | MPS proxy | AI score | hot paragraphs | status |
18
+ |---|---|---|---:|---:|---:|---|
19
+ | adv-mps-ko-01 | ko | marketing | 100.0 | 100.0 | 1/1 | pass |
20
+ | adv-mps-ko-02 | ko | technical | 100.0 | 100.0 | 1/1 | pass |
21
+ | adv-mps-ko-03 | ko | academic | 100.0 | 100.0 | 1/1 | pass |
22
+ | adv-mps-ko-04 | ko | product-doc | 100.0 | 100.0 | 1/1 | pass |
23
+ | adv-mps-ko-05 | ko | policy | 100.0 | 100.0 | 1/1 | pass |
24
+ | adv-mps-en-01 | en | marketing | 100.0 | 100.0 | 1/1 | pass |
25
+ | adv-mps-en-02 | en | technical | 100.0 | 100.0 | 1/1 | pass |
26
+ | adv-mps-en-03 | en | academic | 100.0 | 100.0 | 1/1 | pass |
27
+ | adv-mps-en-04 | en | support | 100.0 | 100.0 | 1/1 | pass |
28
+ | adv-mps-en-05 | en | strategy | 100.0 | 100.0 | 1/1 | pass |
29
+
30
+ ## Interpretation
31
+
32
+ The audit confirms the known gap: an anchor-preservation floor can pass text that still retains AI-marker density. MPS should remain a meaning-safety floor, not a humanness score. A complementary anti-gaming check should penalize repeated AI-marker recurrence after rewrite, especially when MPS is high.
33
+
34
+ ## Proposed MPS-v2 companion check
35
+
36
+ Keep MPS unchanged for semantic safety, then add an independent recurrence gate:
37
+
38
+ 1. Score the original and rewritten text with deterministic `analyzeText`.
39
+ 2. If `MPS ≥ 90` and rewritten AI score remains `≥ 60`, mark the candidate as `style_not_improved`.
40
+ 3. In Ouroboros selection, prefer candidates that pass MPS and lower the AI score; do not let high MPS alone rescue a visibly AI-like rewrite.
41
+ 4. Report preserved anchors and recurring AI markers separately so users can decide whether to edit more or keep the register.
@@ -1,7 +1,7 @@
1
1
  # AI 글과 사람 글 차이의 지표화 — patina 벤치마크 조사 노트
2
2
 
3
3
  작성일: 2026-05-20
4
- 상태: 조사/설계 노트, 구현 제안
4
+ 상태: 조사/설계 노트, 부분 구현 deterministic benchmark와 opt-in live quality scaffold 반영
5
5
 
6
6
  ## 1. 질문
7
7
 
@@ -18,9 +18,9 @@ patina의 벤치마크에서 측정하려는 것은 단순히 “이 글의 출
18
18
  | 층 | 파일 | 현재 측정값 | 성격 |
19
19
  |---|---|---|---|
20
20
  | 결정론적 benchmark | `tests/quality/benchmark.mjs` | accuracy, precision, recall, F1, TP/FP/FN/TN | LLM 호출 없음. stylometry/lexicon hot 판정 회귀 테스트 |
21
- | future live quality regression | proposed `tests/quality/live-quality.mjs` | AI-likeness, MPS, fidelity, PASS/WARN/ERROR | 모델 호출 필요. rewrite 품질과 의미 보존 점검. 현재 main 기준 구현 전 follow-up |
21
+ | opt-in live quality scaffold | `tests/quality/live-quality.mjs` | before/after AI-likeness, deterministic meaning proxy, safe_gain, PASS/WARN/FAIL | 기본 실행은 모델 호출 없이 skip. `OPENCODE_AVAILABLE=1`일 때 rewrite 품질과 의미 보존 proxy 점검 |
22
22
  | scoring spec | `core/scoring.md` | AI-likeness, fidelity, combined score, MPS | 점수 정의의 기준 문서 |
23
- | stylometry spec | `core/stylometry.md` | burstiness CV, MATTR, lexicon density | deterministic signal 정의 |
23
+ | stylometry spec | `core/stylometry.md` | burstiness CV, MATTR, lexicon density, koDiagnostics | deterministic signal 정의 |
24
24
 
25
25
  현재 deterministic hot 판정은 다음 OR 규칙이다.
26
26
 
@@ -28,18 +28,28 @@ patina의 벤치마크에서 측정하려는 것은 단순히 “이 글의 출
28
28
  paragraph is SUSPECT iff
29
29
  burstiness_band == "low"
30
30
  OR MATTR_band == "low"
31
- OR lexicon_density > threshold
31
+ OR (lexicon_density > threshold AND lexicon_min_hits is satisfied)
32
+ OR koDiagnostics.hot == true
32
33
  ```
33
34
 
34
- 현재 레포의 로컬 `tests/quality/results.json` 스냅샷은 ko/en/zh/ja fixture 34기준 전체 accuracy 1.0이다. 다만 이 corpus 작고 설계된 synthetic/curated fixture이므로, 일반화 성능 증거로 과신하면 안 된다.
35
+ `burstiness_band`는 단락 문장이 3이상일 때만 부여한다. 2문장 이하 CV
36
+ 진단값으로만 남기고, 그 값만으로는 hot 판정을 만들지 않는다.
37
+ 한국어/중국어/일본어 lexicon은 단일 hit를 hot으로 보지 않고 audit hint로만 남긴다.
38
+ 짧은 전문/정책 문단에서 보통 명사 하나가 걸리는 오탐을 줄이기 위한 보수적 가드다.
39
+
40
+ 한국어(`lang=ko`)는 `spacing`, `comma`, `posDiversity` 진단값을 추가로 남긴다.
41
+ `posDiversity`는 형태소 분석기가 아니라 조사/어미 suffix class proxy다. 세 지표가 모두
42
+ 보수적 임계값을 넘을 때만 `koDiagnostics.hot`으로 OR 규칙에 들어간다.
43
+
44
+ 현재 공개 benchmark 스냅샷은 ko/en/zh/ja fixture 39개 기준 전체 accuracy 1.0이다. 다만 이 corpus는 작고 설계된 synthetic/curated fixture이므로, 일반화 성능 증거로 과신하면 안 된다.
35
45
 
36
46
  ### 2.1 단기 benchmark 한계와 rebaseline 계획 (#155/#162)
37
47
 
38
48
  현재 체크인된 deterministic report는 회귀 테스트로는 유용하지만, 공개 성능 주장으로 쓰기에는 아직 좁다.
39
49
 
40
- - 표본 수: ko/en/zh/ja suspect-zone fixture 34개. 언어·장르·출처별 신뢰구간을 낼 만큼 크지 않다.
50
+ - 표본 수: ko/en/zh/ja suspect-zone fixture 39개. 언어·장르·출처별 신뢰구간을 낼 만큼 크지 않다.
41
51
  - 모델 시대성: 2025+ GPT/Claude/Gemini/Llama/Qwen 계열 생성문 rebaseline은 아직 별도 follow-up이다.
42
- - 통계 보고: `docs/benchmarks/latest.md`는 fixture 수, lang/class sample size, Wilson 95% CI 공개한다. bootstrap interval과 threshold sweep은 아직 없다.
52
+ - 통계 보고: `docs/benchmarks/latest.md`는 fixture 수, lang/class sample size, Wilson 95% CI `signal_score` 기반 ROC-AUC / PR-AUC / best-F1 threshold 진단을 공개한다. bootstrap interval은 아직 없다.
43
53
  - 범위: 현재 수치는 stylometry/lexicon hot 판정 회귀이며, rewrite 품질·MPS·fidelity의 live 품질 점수가 아니다.
44
54
 
45
55
  따라서 README/benchmark 문구는 “현재 fixture에서 회귀가 통과했다”로 해석해야 하며, “새 모델·새 장르에서도 100% 탐지한다”는 의미가 아니다. 다음 rebaseline에서는 최소한 언어 × class × register별 sample size를 고정하고, confidence interval과 threshold sweep을 함께 게시한다.
@@ -66,7 +76,7 @@ MATTR는 길이에 민감한 TTR 문제를 완화하기 위해 moving window를
66
76
 
67
77
  - 근거: Covington & McFall의 MATTR 논문은 단순 TTR의 길이 의존 문제를 지적하고 moving-average 방식을 제안한다.
68
78
  - patina 관찰: 영어에서는 보조 신호, 한국어에서는 어절 토큰화 때문에 MATTR가 과대평가되어 약한 신호다.
69
- - patina 판단: **현행 유지. 한국어에서는 핵심 신호로 보지 않는다.**
79
+ - patina 판단: **현행 유지. 한국어에서는 주요 신호로 보지 않는다.**
70
80
 
71
81
  ### 3.3 AI-favored lexicon density
72
82
 
@@ -80,7 +90,7 @@ density = matched_ai_lexicon_entries / token_count * 1000
80
90
  - 약점: 도메인별 정상 전문어와 AI 포장어를 혼동할 수 있다.
81
91
  - patina 판단: **언어/도메인별 calibration이 필요하다.** 특히 한국어 lexicon은 paired corpus mining 방식이 효과적이었다.
82
92
 
83
- ### 3.4 패턴 기반 severity score
93
+ ### 3.4 패턴 severity score
84
94
 
85
95
  patina의 가장 중요한 차별점은 “AI detection”보다 “AI writing pattern editing”에 있다. 따라서 다음 패턴들은 detector feature이면서 동시에 rewrite target이다.
86
96
 
@@ -91,13 +101,13 @@ patina의 가장 중요한 차별점은 “AI detection”보다 “AI writing p
91
101
  - 출처 없는 권위 주장
92
102
  - viral-hook score-only rhetoric
93
103
 
94
- patina 판단: **패턴 기반 점수는 계속 중심축이어야 한다.** 통계 지표는 패턴 스캔의 보조 신호로 둔다.
104
+ patina 판단: **패턴 점수는 계속 중심축이어야 한다.** 통계 지표는 패턴 스캔의 보조 신호로 둔다.
95
105
 
96
- ### 3.5 언어모델 확률 기반 지표
106
+ ### 3.5 언어모델 확률 지표
97
107
 
98
108
  대표 연구:
99
109
 
100
- | 접근 | 핵심 아이디어 | patina 적용성 |
110
+ | 접근 | 주요 아이디어 | patina 적용성 |
101
111
  |---|---|---|
102
112
  | [GLTR](https://arxiv.org/abs/1906.04043) | token probability, rank, entropy로 “너무 예측 가능한 선택”을 시각화 | 설명 가능. 단, local LM 또는 API logprob 필요 |
103
113
  | [DetectGPT](https://arxiv.org/abs/2301.11305) | 생성 텍스트가 log-probability curvature의 특정 영역에 놓인다는 가설 | 강력하지만 perturbation/LM logprob 비용 큼 |
@@ -118,13 +128,15 @@ patina 판단: **기본 benchmark에는 넣지 말고 optional research track으
118
128
  - sentence opener diversity
119
129
  - POS-like rough proxy: 명사형 종결, 수동 표현, nominalization suffix
120
130
 
121
- patina 판단: **다음 deterministic feature 후보 1순위.** 외부 모델 없이 구현 가능하고, pattern/lexicon과 다른 축이다.
131
+ patina 판단: **한국어부터 보수적 composite로 시작했다.** 외부 모델 없이 구현 가능하고,
132
+ pattern/lexicon과 다른 축이라 유망하지만, 쉼표 없음 같은 단일 특징은 hot 판정에 넣지 않는다.
133
+ spacing/comma/suffix proxy가 동시에 맞을 때만 `koDiagnostics.hot`을 켠다.
122
134
 
123
135
  ## 4. 탐지 연구에서 얻을 주의사항
124
136
 
125
137
  ### 4.1 detector는 일반적으로 OOD에 약하다
126
138
 
127
- [M4](https://arxiv.org/abs/2305.14902)는 multi-generator, multi-domain, multilingual 환경에서 detector의 generalization이 어렵다고 보고한다. [RAID](https://arxiv.org/abs/2405.07940)는 sampling strategy, adversarial attack, unseen model 변화에 기존 detector들이 쉽게 속는다고 보고한다.
139
+ [M4](https://arxiv.org/abs/2305.14902)는 multi-generator, multi-domain, multilingual 맥락에서 detector의 generalization이 어렵다고 보고한다. [RAID](https://arxiv.org/abs/2405.07940)는 sampling strategy, adversarial attack, unseen model 변화에 기존 detector들이 쉽게 속는다고 보고한다.
128
140
 
129
141
  patina 벤치마크도 같은 함정이 있다. synthetic fixture에서 100%가 나와도 다음 상황에서는 깨질 수 있다.
130
142
 
@@ -136,7 +148,7 @@ patina 벤치마크도 같은 함정이 있다. synthetic fixture에서 100%가
136
148
 
137
149
  ### 4.2 단일 “AI 확률”로 말하면 안 된다
138
150
 
139
- OpenAI도 기존 AI classifier를 낮은 정확도 때문에 중단했고, 짧은 텍스트·비영어·OOD·편집된 AI 글에서 취약하다고 명시했다. 공식 문서의 핵심 메시지는 “주요 의사결정 도구로 쓰지 말라”는 것이다.
151
+ OpenAI도 기존 AI classifier를 낮은 정확도 때문에 중단했고, 짧은 텍스트·비영어·OOD·편집된 AI 글에서 취약하다고 명시했다. 공식 문서의 주요 메시지는 “주요 의사결정 도구로 쓰지 말라”는 것이다.
140
152
 
141
153
  patina도 `AI-generated probability`가 아니라 다음처럼 표현해야 한다.
142
154
 
@@ -150,7 +162,7 @@ patina도 `AI-generated probability`가 아니라 다음처럼 표현해야 한
150
162
  - watermark: 특정 생성 시스템이 사전에 심은 신호 검출
151
163
  - patina: 결과 텍스트의 문체적 AI 신호 축소
152
164
 
153
- 따라서 watermark는 참고 연구로만 두고, patina benchmark의 핵심 지표로 넣지 않는다.
165
+ 따라서 watermark는 참고 연구로만 두고, patina benchmark의 주요 지표로 넣지 않는다.
154
166
 
155
167
  ## 5. patina에 맞는 지표 설계안
156
168
 
@@ -196,10 +208,10 @@ deterministic_ai_likeness =
196
208
 
197
209
  | Feature | 초기 weight | 이유 |
198
210
  |---|---:|---|
199
- | pattern severity | 0.35 | patina의 핵심 목적과 직접 연결 |
211
+ | pattern severity | 0.35 | patina의 주요 목적과 직접 연결 |
200
212
  | lexicon density | 0.20 | 설명 가능하고 calibration 경험 있음 |
201
- | burstiness CV | 0.20 | 현재 ko/en/zh/ja suspect-zone fixture에서 유효한 핵심 deterministic 신호 |
202
- | function-word divergence | 0.15 | 다음 확장 후보. 주제 독립성이 높음 |
213
+ | burstiness CV | 0.20 | 현재 ko/en/zh/ja suspect-zone fixture에서 유효한 주요 deterministic 신호 |
214
+ | function-word divergence | 0.15 | 한국어 suffix proxy부터 보수적 composite로 시작. 주제 독립성이 높음 |
203
215
  | punctuation/opening rhythm | 0.05 | 가벼운 보조 신호 |
204
216
  | MATTR | 0.05 | 한국어에서 약하므로 낮게 시작 |
205
217
 
@@ -237,7 +249,7 @@ safe_gain = max(0, humanization_gain) * (meaning_safety / 100)
237
249
 
238
250
  #### Stability
239
251
 
240
- LLM 기반 rewrite는 비결정적이므로 같은 fixture를 N회 돌린 분산도 필요하다.
252
+ LLM rewrite는 비결정적이므로 같은 fixture를 N회 돌린 분산도 필요하다.
241
253
 
242
254
  ```text
243
255
  score_stability = stddev(after_ai_likeness over N runs)
@@ -306,13 +318,13 @@ notes: |
306
318
  ### Phase 1 — 보고 강화, 의존성 없음
307
319
 
308
320
  - `tests/quality/results.json`에 feature vector를 더 상세히 저장
309
- - threshold sweep과 ROC-AUC/PR-AUC 계산 추가
321
+ - 더 큰 2025+ corpus에서 threshold sweep과 ROC-AUC/PR-AUC 재계산
310
322
  - per-register/per-language/per-class 요약 추가
311
323
  - `tests/quality/README.md`에 “AI-likeness이지 provenance가 아님” 명시
312
324
 
313
325
  ### Phase 2 — corpus 확장
314
326
 
315
- - synthetic/curated fixture 34개에서 real-world sampled fixture로 확장
327
+ - synthetic/curated fixture 39개에서 real-world sampled fixture로 확장
316
328
  - human false positive register를 먼저 늘린다
317
329
  - 목표: 언어별 최소 100 human + 100 AI paragraph
318
330
 
@@ -357,7 +369,7 @@ patina가 따라야 할 방향은 범용 AI detector가 아니다. 더 강한
357
369
  4. **단일 AI 확률을 주장하지 않는다.**
358
370
  5. 무거운 LM-probability detector는 optional research track으로 격리한다.
359
371
 
360
- 즉, patina benchmark의 핵심 지표는 다음 세 문장으로 요약할 수 있다.
372
+ 즉, patina benchmark의 주요 지표는 다음 세 문장으로 요약할 수 있다.
361
373
 
362
374
  ```text
363
375
  얼마나 AI스럽게 읽히는가?
@@ -0,0 +1,42 @@
1
+ # Blinded human evaluation panel plan
2
+
3
+ Status: study design ready; panel not run.
4
+ Related issue: #159.
5
+
6
+ This plan keeps human preference work separate from deterministic scoring. It should run only with texts that can be shown to reviewers and redistributed or summarized under consent.
7
+
8
+ ## Research question
9
+
10
+ Can blinded readers tell which text is the AI-like draft and which one is the Patina rewrite, and do they prefer the rewrite without seeing meaning loss?
11
+
12
+ ## Minimum pilot
13
+
14
+ - 30 paired samples;
15
+ - 5 raters per sample;
16
+ - language and register recorded for each pair;
17
+ - randomized A/B order;
18
+ - no model or tool names shown to raters;
19
+ - reviewer consent and redistribution notes stored outside public fixtures unless publishable.
20
+
21
+ ## Rater task
22
+
23
+ For each pair, ask:
24
+
25
+ 1. Which version reads more natural for the stated context?
26
+ 2. Did either version lose a key fact, number, name, or caveat?
27
+ 3. Which version would you send with light edits?
28
+ 4. Free-text note, optional.
29
+
30
+ ## Report shape
31
+
32
+ | metric | output |
33
+ |---|---|
34
+ | naturalness preference | Patina / original / tie counts with confidence interval |
35
+ | meaning concern | rate of reported fact/caveat loss |
36
+ | register split | results by language and register |
37
+ | rater agreement | kappa or raw agreement, depending on labels |
38
+ | exclusions | samples removed and why |
39
+
40
+ ## Privacy rule
41
+
42
+ Do not commit reviewer names, private comments, or no-redistribution source text. Public reports can include aggregate counts and short examples only when the source license and reviewer consent allow it.
@@ -0,0 +1,24 @@
1
+ # Cross-judge agreement plan
2
+
3
+ Status: CLI shortcut removed during surface simplification; full matrix blocked on evaluator budget.
4
+ Related issue: #158.
5
+
6
+ A score is more useful when the judge is not from the same model family as the suspected generator. Patina no longer exposes a per-run CLI warning for this; cross-family independence belongs in an explicit evaluation matrix rather than everyday score UX.
7
+
8
+ ## Full matrix gate
9
+
10
+ The full issue is still open until a report covers:
11
+
12
+ - 3 generator families × 3 judge families × 30 samples;
13
+ - shared prompts and fixed sample ids;
14
+ - pairwise agreement table;
15
+ - Krippendorff alpha or Cohen/Fleiss kappa where the labels support it;
16
+ - a note when a judge is evaluating its own family.
17
+
18
+ ## Matrix template
19
+
20
+ | sample set | generator | judge | n | hot agree | hot disagree | agreement |
21
+ |---|---|---|---:|---:|---:|---:|
22
+ | pending | pending | pending | 0 | 0 | 0 | n/a |
23
+
24
+ Do not fill this table with synthetic numbers. Use it only after the manifest and judge outputs exist.
@@ -0,0 +1,135 @@
1
+ # KO/2025+ Corpus Source Inventory
2
+
3
+ Verified: 2026-05-22. This inventory turns the Korean rebaseline blocker
4
+ (#303/#157/#155/#160) into an executable intake plan. It is not a public
5
+ performance claim.
6
+
7
+ ## Decision
8
+
9
+ Use a **metadata-first corpus**:
10
+
11
+ 1. keep raw text in `artifacts/rebaseline-2025/` or another private store;
12
+ 2. commit only redistributable examples, hashes, metadata, and aggregate reports;
13
+ 3. do not publish catch-rate claims until `docs/research/2025-rebaseline-plan.md`
14
+ reaches its n≥100 public claim gate.
15
+
16
+ That lets Patina use current Korean sources without copying restricted corpora
17
+ into the repository.
18
+
19
+ ## Source matrix
20
+
21
+ | source | role | evidence | repo policy | first intake |
22
+ |---|---|---|---|---|
23
+ | KatFish / KatFishNet | Korean AI-vs-human seed set; especially useful for #303 punctuation/spacing checks | ACL 2025/arXiv paper says KatFish covers human and four-LLM Korean text across three genres; the public GitHub repo contains `katfish_dataset/*.jsonl` but no repo-level license metadata is exposed | Treat raw rows as **license-review** until a license/redistribution decision is recorded. Aggregate metrics are OK; raw rows stay private. | `docs/benchmarks/katfish-ko-latest.md` reports aggregate-only calibration: Patina KO diagnostics improve KatFish catch rate by +15.9 pp with +0 public-web human-control FP rows. |
24
+ | 모두의 말뭉치 | human Korean controls for product-doc, news/editorial, dialogue, summary registers | NIKL lists 2024/2025 corpora and requires corpus application/approval fields | Do not commit raw text. Store local extracts privately and commit only hashes/metrics unless approval explicitly allows publication. | Apply for research/evaluation use; start with 25 human-control paragraphs after approval. |
25
+ | 한국어 학습자 말뭉치 | false-positive stress test for learner/second-language Korean, not a normal human baseline | Official site describes learner writing/speech corpora and notes original learner material is privacy-protected; Data.go.kr lists research-purpose distribution limits | Use as a separate FP envelope. Do not blend with native Korean controls. Commit metadata/hashes only. | Add `class: natural-human`, `register: learner-writing` only after schema/register decision, or map to `academic-summary` with reviewer note. |
26
+ | HAERAE-HUB/KOREAN-SyntheticText-1.5B | broad synthetic Korean AI-like pool | Hugging Face dataset page shows text parquet with 1.55M rows | Synthetic side only. Check dataset card/license before committing full text; otherwise hash-only. | Sample short paragraphs for lexicon mining candidates, then manually review before pattern changes. |
27
+ | Maintainer-generated 2026 prompts | controlled GPT/Claude/Gemini/open-weight model-era rows | Generated from repo-owned prompts and reproducible metadata | Preferred public seed when provider terms and prompt contents allow redistribution. Keep prompts public; keep vendor UI copies private if unsure. | Generate 5 rows each for GPT-family, Claude-family, Gemini-family, and open-weight across blog/product-doc/chat-update. |
28
+ | Community false-positive submissions | real Patina FP cases | GitHub false-positive issue template captures language/register/score output | Use only with explicit fixture permission. Strip account/private context by default. | Convert accepted issues into hash-only rows first; promote to fixture only after permission. |
29
+ | Public Korean web pages (Korea.kr, Toss Tech, Kakao/Naver/Toss docs, KISTEP/KEI/NRF, Seoul OpenGov) | natural-human Korean pilot controls for source/provenance workflow | Public pages with visible source URLs and licensing/copyright guidance where available | Commit only hash-only metadata until page-level redistribution and attribution are reviewed. Keep raw extracts in ignored `artifacts/rebaseline-2025/private/`. | `artifacts/rebaseline-2025/human-controls.public.jsonl` contains 250 scored hash-only candidate rows, with 50 rows each for blog, product-doc, academic-summary, technical-how-to, and chat-update; positive AI-like rows still block threshold work. |
30
+
31
+ ## Intake commands
32
+
33
+ Use the local workspace scaffold:
34
+
35
+ ```bash
36
+ npm run benchmark:rebaseline:intake -- \
37
+ --input artifacts/rebaseline-2025/intake.local.jsonl \
38
+ --public-output artifacts/rebaseline-2025/manifest.public.jsonl \
39
+ --private-output artifacts/rebaseline-2025/private/generations.private.jsonl \
40
+ --require-source-review
41
+
42
+ node scripts/rebaseline-summary.mjs \
43
+ --input artifacts/rebaseline-2025/manifest.public.jsonl \
44
+ --json
45
+ ```
46
+
47
+ The intake helper computes missing `text_hash` values. If redistribution is not
48
+ public, it strips `text` from the public manifest and writes the full row to the
49
+ private output path.
50
+
51
+ Tracked starter files:
52
+
53
+ - `artifacts/rebaseline-2025/prompts.template.jsonl` — repo-owned prompt
54
+ anchors for Korean academic/종결-다, product-doc, blog, chat/update,
55
+ technical-how-to, and edited-AI rows.
56
+ - `artifacts/rebaseline-2025/intake.local.example.jsonl` — 25 metadata-only
57
+ rows matching the pilot buckets below. The hashes are placeholders; replace
58
+ them locally before treating the file as evidence.
59
+ - `artifacts/rebaseline-2025/sources.ko-public.jsonl` — tracked public-source
60
+ inventory for hash-only web collection.
61
+ - `artifacts/rebaseline-2025/human-controls.public.jsonl` — 250 web-sourced
62
+ Korean natural-human candidate rows, balanced at 50 rows for each tracked
63
+ register. It is hash-only and validates the collection path; absent AI-like
64
+ cells still block public claim changes.
65
+
66
+ To refresh public-web candidates from the tracked source inventory:
67
+
68
+ ```bash
69
+ npm run benchmark:rebaseline:web -- \
70
+ --input artifacts/rebaseline-2025/sources.ko-public.jsonl \
71
+ --output artifacts/rebaseline-2025/private/web-human-controls.generated.private.jsonl \
72
+ --target-per-register 50 \
73
+ --max-per-source 12 \
74
+ --collected-at 2026-05-22
75
+
76
+ npm run benchmark:rebaseline:score -- \
77
+ --input artifacts/rebaseline-2025/private/web-human-controls.generated.private.jsonl \
78
+ --output artifacts/rebaseline-2025/human-controls.public.jsonl \
79
+ --scored-at 2026-05-22
80
+
81
+ node scripts/rebaseline-summary.mjs \
82
+ --input artifacts/rebaseline-2025/human-controls.public.jsonl \
83
+ --json
84
+
85
+ npm run benchmark:katfish-ko -- --write --basename katfish-ko-latest
86
+ ```
87
+
88
+ To score those candidates without committing raw text:
89
+
90
+ ```bash
91
+ npm run benchmark:rebaseline:score -- \
92
+ --input artifacts/rebaseline-2025/private/web-human-controls.generated.private.jsonl \
93
+ --output artifacts/rebaseline-2025/human-controls.public.jsonl \
94
+ --scored-at 2026-05-22
95
+ ```
96
+
97
+ ## Remaining Korean pilot holes
98
+
99
+ The 250-row public-web pilot proves the collection and scoring path and fills the native-human register coverage gate for #157. Use the original 25-row skeleton only as a local intake template; future rows should fill the positive and comparison holes instead:
100
+
101
+ | bucket | remaining need | notes |
102
+ |---|---:|---|
103
+ | native human controls | 0 for the five tracked registers | Current tracked split is n=50 each for academic-summary, product-doc, chat-update, blog, and technical-how-to. |
104
+ | self-generated AI-like | n≥100 per GPT/Claude/Gemini/open-weight claim cell | Keep prompt ids, model ids, decoding, and provider terms notes; KatFish aggregate closes only the KO diagnostic calibration, not public claim coverage. |
105
+ | lightly/heavily edited AI | at least one light and one heavy edit per target register | Preserve before/after hashes and edit policy. |
106
+ | KatFish aggregate comparison | complete for #303 | Raw rows stay private; tracked output is aggregate-only. |
107
+ | FP submissions / learner stress | separate reviewed envelope | Do not blend learner Korean into the native-human baseline. |
108
+
109
+ Exit criteria for the pilot:
110
+
111
+ - `npm run benchmark:rebaseline:intake -- --input artifacts/rebaseline-2025/intake.local.jsonl --dry-run --require-source-review`
112
+ passes before any rows are scored.
113
+ - `node scripts/rebaseline-summary.mjs --input <manifest>` validates with no errors.
114
+ - every row has `source_review` or `reviewer_notes` explaining redistribution
115
+ status when raw text is absent;
116
+ - no threshold or README catch-rate claim changes are made from the pilot alone;
117
+ - findings are posted back to #303/#157/#155 before #160 lexicon mining starts.
118
+
119
+ ## Source links
120
+
121
+ - KatFishNet paper: <https://arxiv.org/abs/2503.00032>
122
+ - KatFishNet repository: <https://github.com/Shinwoo-Park/katfishnet>
123
+ - 모두의 말뭉치 request page: <https://kli.korean.go.kr/corpus/main/requestMain.do>
124
+ - 모두의 말뭉치 introduction: <https://kli.korean.go.kr/m/introduce/corpusIntroduce.do>
125
+ - 한국어 학습자 말뭉치: <https://kcorpus.korean.go.kr/index/goIntroduceSite.do>
126
+ - 한국어 학습자 말뭉치 Data.go.kr metadata: <https://www.data.go.kr/data/15094033/fileData.do>
127
+ - HAERAE Korean SyntheticText: <https://huggingface.co/datasets/HAERAE-HUB/KOREAN-SyntheticText-1.5B>
128
+ - KOGL introduction: <https://www.kogl.or.kr/info/introduce.do>
129
+ - KOGL license guide: <https://www.kogl.or.kr/info/license.do>
130
+ - Korea.kr policy article: <https://www.korea.kr/news/policyNewsView.do?newsId=148959377>
131
+ - MCST KOGL type guide: <https://www.mcst.go.kr/site/s_open/kogl/koglType.jsp>
132
+ - MCST copyright Q&A: <https://www.mcst.go.kr/site/s_policy/copyright/question/question17.jsp>
133
+ - Seoul OpenGov copyright policy: <https://opengov.seoul.go.kr/copyright>
134
+ - Tracked public-web source inventory:
135
+ `artifacts/rebaseline-2025/sources.ko-public.jsonl`