patina-cli 3.11.0 → 4.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.patina.default.yaml +29 -29
- package/CHANGELOG.md +53 -0
- package/NOTICE +21 -0
- package/README.md +117 -224
- package/README_JA.md +134 -77
- package/README_KR.md +132 -74
- package/README_ZH.md +137 -80
- package/SKILL.md +11 -20
- package/artifacts/rebaseline-2025/README.md +147 -0
- package/artifacts/rebaseline-2025/human-controls.public.jsonl +250 -0
- package/artifacts/rebaseline-2025/intake.example.jsonl +2 -0
- package/artifacts/rebaseline-2025/intake.local.example.jsonl +25 -0
- package/artifacts/rebaseline-2025/prompts.template.jsonl +7 -0
- package/artifacts/rebaseline-2025/sources.ko-public.jsonl +39 -0
- package/assets/brand/patina-badge.svg +18 -0
- package/assets/brand/patina-mark.svg +8 -0
- package/assets/demo/README.md +79 -0
- package/core/scoring.md +12 -12
- package/core/standalone-prompt.md +3 -1
- package/core/stylometry.md +93 -22
- package/docs/API.md +1554 -0
- package/docs/AUTHENTICATION.md +50 -26
- package/docs/AUTHENTICATION_KR.md +54 -29
- package/docs/BRANDING.md +9 -8
- package/docs/CLI.md +55 -14
- package/docs/COOKBOOK.md +8 -21
- package/docs/DEMO.md +32 -5
- package/docs/EXIT-CODES.md +2 -3
- package/docs/FALSE-POSITIVES.md +63 -0
- package/docs/FAQ.md +9 -1
- package/docs/FAQ_KR.md +3 -1
- package/docs/FLAG-PARITY.md +33 -47
- package/docs/ISSUE-WAVES.md +57 -0
- package/docs/PATTERNS-EN.md +67 -3
- package/docs/PATTERNS-JA.md +68 -2
- package/docs/PATTERNS-KO.md +70 -7
- package/docs/PATTERNS-ZH.md +67 -3
- package/docs/PATTERNS.md +5 -5
- package/docs/RESEARCH-DOCS-PLATFORM.md +54 -0
- package/docs/ROADMAP.md +46 -66
- package/docs/TRANSLATIONESE-KO.md +51 -0
- package/docs/audits/2026-05-deep-research.md +3 -1
- package/docs/benchmarks/README.md +51 -0
- package/docs/benchmarks/detector-comparison.json +69 -9
- package/docs/benchmarks/detector-comparison.md +10 -5
- package/docs/benchmarks/katfish-ko-latest.json +657 -0
- package/docs/benchmarks/katfish-ko-latest.md +77 -0
- package/docs/benchmarks/latest.json +1183 -108
- package/docs/benchmarks/latest.md +84 -60
- package/docs/benchmarks/lexicon-freshness-en-2026-05-22.json +1121 -0
- package/docs/benchmarks/lexicon-freshness-en-2026-05-22.md +136 -0
- package/docs/benchmarks/rebaseline-latest.json +381 -0
- package/docs/benchmarks/rebaseline-latest.md +121 -0
- package/docs/benchmarks/register-stratified-latest.json +164 -0
- package/docs/benchmarks/register-stratified-latest.md +99 -0
- package/docs/benchmarks/register-stratified.md +43 -0
- package/docs/integrations/github-action.md +44 -11
- package/docs/integrations/playground.md +58 -0
- package/docs/integrations/pre-commit.md +5 -5
- package/docs/integrations/release.md +5 -3
- package/docs/integrations/static-sites.md +83 -0
- package/docs/research/2025-rebaseline-plan.md +71 -2
- package/docs/research/2026-rebaseline.md +102 -0
- package/docs/research/adversarial-mps.md +41 -0
- package/docs/research/ai-human-metrics.md +35 -23
- package/docs/research/human-eval-panel.md +42 -0
- package/docs/research/judge-agreement.md +24 -0
- package/docs/research/ko-2025-corpus-sources.md +135 -0
- package/docs/research/lexicon-freshness-audit.md +64 -0
- package/docs/research/zh-ja-lexicon-calibration.md +60 -0
- package/docs/social/patina-launch-copy.md +173 -100
- package/docs/social/patina-launch-execution.md +94 -0
- package/docs/social/patina-launch-korean-first.md +83 -0
- package/docs/social/signs-of-ai-writing.md +26 -0
- package/docs/social/signs-of-ai-writing_KR.md +26 -0
- package/lexicon/ai-en.md +21 -24
- package/lexicon/ai-ja.md +158 -0
- package/lexicon/ai-ko.md +9 -9
- package/lexicon/ai-zh.md +158 -0
- package/lexicon/provenance/ai-en.json +970 -0
- package/lexicon/provenance/ai-ja.json +542 -0
- package/lexicon/provenance/ai-ko.json +866 -0
- package/lexicon/provenance/ai-zh.json +542 -0
- package/package.json +49 -8
- package/patterns/en-communication.md +5 -0
- package/patterns/en-content.md +5 -0
- package/patterns/en-filler.md +5 -0
- package/patterns/en-language.md +29 -1
- package/patterns/en-structure.md +5 -0
- package/patterns/en-style.md +5 -0
- package/patterns/en-viral-hook.md +42 -2
- package/patterns/ja-communication.md +5 -0
- package/patterns/ja-content.md +5 -0
- package/patterns/ja-filler.md +5 -0
- package/patterns/ja-language.md +33 -1
- package/patterns/ja-structure.md +12 -0
- package/patterns/ja-style.md +5 -0
- package/patterns/ja-viral-hook.md +41 -2
- package/patterns/ko-communication.md +5 -0
- package/patterns/ko-content.md +5 -0
- package/patterns/ko-filler.md +5 -0
- package/patterns/ko-language.md +33 -1
- package/patterns/ko-structure.md +25 -6
- package/patterns/ko-style.md +5 -0
- package/patterns/ko-viral-hook.md +38 -2
- package/patterns/zh-communication.md +5 -0
- package/patterns/zh-content.md +5 -0
- package/patterns/zh-filler.md +5 -0
- package/patterns/zh-language.md +37 -1
- package/patterns/zh-structure.md +12 -0
- package/patterns/zh-style.md +5 -0
- package/patterns/zh-viral-hook.md +38 -2
- package/playground/README.md +55 -0
- package/playground/analytics.js +4 -0
- package/playground/analyzer.js +883 -0
- package/playground/app.js +157 -0
- package/playground/data/lexicons.js +343 -0
- package/playground/index.html +138 -0
- package/playground/styles.css +267 -0
- package/profiles/namuwiki.md +111 -0
- package/scripts/adversarial-mps-report.mjs +201 -0
- package/scripts/badge-json.mjs +79 -0
- package/scripts/benchmark-report.mjs +56 -9
- package/scripts/check-release-metadata.mjs +0 -2
- package/scripts/detector-comparison.mjs +7 -7
- package/scripts/generate-playground-data.mjs +77 -0
- package/scripts/katfish-calibration.mjs +464 -0
- package/scripts/lexicon-freshness.mjs +485 -0
- package/scripts/lint.mjs +1 -1
- package/scripts/precommit-score.mjs +4 -3
- package/scripts/prose-score.mjs +81 -5
- package/scripts/rebaseline-intake.mjs +242 -0
- package/scripts/rebaseline-score.mjs +268 -0
- package/scripts/rebaseline-summary.mjs +773 -0
- package/scripts/rebaseline-web-collect.mjs +410 -0
- package/scripts/update-benchmark-ranges.mjs +1 -0
- package/src/api.js +69 -105
- package/src/auth.js +50 -2
- package/src/backends/claude-cli.js +19 -4
- package/src/backends/codex-cli.js +19 -3
- package/src/backends/contract.js +230 -1
- package/src/backends/gemini-cli.js +18 -5
- package/src/backends/index.js +87 -12
- package/src/backends/kimi-cli.js +161 -0
- package/src/cli.js +577 -567
- package/src/commands/doctor.js +2 -2
- package/src/config.js +29 -0
- package/src/errors.js +53 -1
- package/src/features/discourse-tells.js +68 -0
- package/src/features/index.js +82 -8
- package/src/features/lexicon.js +40 -6
- package/src/features/markup-leakage.js +69 -0
- package/src/features/segment.js +41 -0
- package/src/features/signal-strength.js +81 -0
- package/src/features/stylometry.js +231 -1
- package/src/features/translationese.js +127 -0
- package/src/loader.js +76 -0
- package/src/logger.js +22 -23
- package/src/model-defaults.js +55 -0
- package/src/ouroboros.js +31 -0
- package/src/output.js +102 -90
- package/src/prompt-builder.js +103 -68
- package/src/providers.js +51 -4
- package/src/scoring.js +210 -2
- package/src/security.js +75 -0
- package/tests/fixtures/live-quality/en/public-docs-01.md +26 -0
- package/tests/fixtures/live-quality/ko/public-docs-01.md +26 -0
- package/tests/fixtures/suspect-zones/expected-ranges.json +207 -16
- package/tests/fixtures/suspect-zones/ja/ai/ja-ai-04-lexicon.md +11 -0
- package/tests/fixtures/suspect-zones/ja/natural/ja-nat-04-lexicon-cold.md +11 -0
- package/tests/fixtures/suspect-zones/ko/ai/ko-ai-02.md +4 -5
- package/tests/fixtures/suspect-zones/ko/ai/ko-ai-07-ko-diagnostic.md +11 -0
- package/tests/fixtures/suspect-zones/zh/ai/zh-ai-04-lexicon.md +11 -0
- package/tests/fixtures/suspect-zones/zh/natural/zh-nat-04-lexicon-cold.md +11 -0
- package/tests/quality/README.md +188 -11
- package/tests/quality/adversarial-mps/fixtures.jsonl +10 -0
- package/tests/quality/benchmark.mjs +39 -1
- package/tests/quality/dogfood.mjs +5 -3
- package/tests/quality/live-fixtures.jsonl +2 -0
- package/tests/quality/live-quality.mjs +596 -0
- package/tests/quality/ranking-metrics.mjs +136 -0
- package/tests/quality/rebaseline-manifest.example.jsonl +5 -0
- package/vercel.json +53 -0
- package/SKILL-MAX.md +0 -455
- package/docs/internal/HARNESS.md +0 -14
- package/docs/internal/README.md +0 -14
- package/docs/internal/WARP.md +0 -23
- package/patina-max/SKILL.md +0 -523
- package/patina-max/composite.py +0 -457
- package/src/cache.js +0 -106
- package/src/commands/init.js +0 -208
- package/src/manifest.js +0 -162
- package/src/max-mode.js +0 -207
|
@@ -2,7 +2,8 @@
|
|
|
2
2
|
|
|
3
3
|
Status: protocol template, no external detector/vendor claims yet.
|
|
4
4
|
Owner: maintainers.
|
|
5
|
-
Related issues: #155, #
|
|
5
|
+
Related issues: #155, #157, #160, #303, plus #158/#159 for evaluator follow-up. The #156 adversarial MPS gate is now in-tree.
|
|
6
|
+
Korean source inventory: `docs/research/ko-2025-corpus-sources.md`.
|
|
6
7
|
|
|
7
8
|
Patina's checked-in benchmark is a deterministic regression corpus. It is useful
|
|
8
9
|
for catching tokenizer, threshold, lexicon, and fixture drift. It is not enough
|
|
@@ -24,12 +25,29 @@ Minimum matrix:
|
|
|
24
25
|
| generators | at least one GPT-family, Claude-family, Gemini-family, and open-weight model |
|
|
25
26
|
| sample size | start at 25 paragraphs per language × class × register cell before publishing claims |
|
|
26
27
|
|
|
28
|
+
The 25-paragraph matrix is the first intake target for checking coverage holes.
|
|
29
|
+
It is not the public benchmark gate. The stricter claim gate remains the
|
|
30
|
+
`process/pattern-freshness.md` requirement: at least three generator families
|
|
31
|
+
across at least two languages with n≥100 per claim cell and binomial 95%
|
|
32
|
+
confidence intervals.
|
|
33
|
+
|
|
34
|
+
|
|
35
|
+
## Execution order
|
|
36
|
+
|
|
37
|
+
1. Start with Korean calibration (#303/#157): collect natural Korean controls for academic/종결-다, blog, product-doc, and community registers before changing KO thresholds again.
|
|
38
|
+
2. Then run the model-era rebaseline (#155): score the fixed manifest across at least three generator families and at least two languages.
|
|
39
|
+
3. Keep lexicon freshness separate from public catch-rate claims: English #160 remine now has per-entry provenance and ≥4× hot/cold lift evidence; rerun it during quarterly reviews, and do the same for KO/ZH/JA when paired corpora exist.
|
|
40
|
+
4. Use the same manifest for cross-judge and blinded-panel follow-ups (#158/#159) instead of creating separate incompatible samples; keep the in-tree adversarial MPS gate current as a companion check.
|
|
41
|
+
|
|
27
42
|
## Data rules
|
|
28
43
|
|
|
29
44
|
- Use redistributable prompts and generated text; do not check private user text into the repo.
|
|
30
45
|
- Store full text only when redistribution is allowed. Otherwise store hashes, metadata, and metrics.
|
|
31
46
|
- Keep generation metadata: model, provider, date, prompt id, decoding params, language, register, and any editing pass.
|
|
32
47
|
- Separate detector-facing evaluation from rewrite-quality evaluation.
|
|
48
|
+
- For Korean 2025+ rows, start from `docs/research/ko-2025-corpus-sources.md`
|
|
49
|
+
and run the local intake helper before any public report:
|
|
50
|
+
`npm run benchmark:rebaseline:intake -- --input artifacts/rebaseline-2025/intake.local.jsonl --require-source-review`.
|
|
33
51
|
|
|
34
52
|
## Metrics
|
|
35
53
|
|
|
@@ -37,7 +55,7 @@ For deterministic suspect-zone detection:
|
|
|
37
55
|
|
|
38
56
|
- accuracy, precision, recall, F1
|
|
39
57
|
- Wilson 95% confidence interval per language and register
|
|
40
|
-
- detector sub-signal breakdown: burstiness, MATTR, lexicon
|
|
58
|
+
- detector sub-signal breakdown: burstiness, MATTR, lexicon, koDiagnostics
|
|
41
59
|
- expected metric ranges for checked-in fixtures
|
|
42
60
|
|
|
43
61
|
For rewrite quality:
|
|
@@ -60,6 +78,11 @@ Recommended local/private layout before publishing sanitized summaries:
|
|
|
60
78
|
|
|
61
79
|
```text
|
|
62
80
|
artifacts/rebaseline-2025/
|
|
81
|
+
├── README.md # tracked scaffold and privacy rules
|
|
82
|
+
├── intake.example.jsonl # tracked smoke fixture, not evidence
|
|
83
|
+
├── intake.local.jsonl # gitignored local working input
|
|
84
|
+
├── manifest.public.jsonl # gitignored sanitized output for local review
|
|
85
|
+
├── human-controls.public.jsonl # tracked hash-only KO web candidate manifest
|
|
63
86
|
├── prompts.jsonl
|
|
64
87
|
├── generations.jsonl # only redistributable text
|
|
65
88
|
├── generations.private.jsonl # gitignored/private text if needed
|
|
@@ -71,6 +94,52 @@ artifacts/rebaseline-2025/
|
|
|
71
94
|
|
|
72
95
|
Checked-in public summaries should live under `docs/benchmarks/` after review.
|
|
73
96
|
|
|
97
|
+
## Manifest scaffold
|
|
98
|
+
|
|
99
|
+
Use the checked-in example manifest to validate the metadata contract before
|
|
100
|
+
collecting a larger corpus:
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
npm run benchmark:rebaseline
|
|
104
|
+
npm run benchmark:rebaseline:report
|
|
105
|
+
node scripts/rebaseline-summary.mjs --input tests/quality/rebaseline-manifest.example.jsonl --json
|
|
106
|
+
npm run benchmark:rebaseline:intake -- --input artifacts/rebaseline-2025/intake.example.jsonl --dry-run
|
|
107
|
+
npm run benchmark:rebaseline:intake -- --input artifacts/rebaseline-2025/intake.local.example.jsonl --dry-run --require-source-review
|
|
108
|
+
npm run benchmark:rebaseline:web -- --target-per-register 50 --max-per-source 12 --collected-at 2026-05-22
|
|
109
|
+
npm run benchmark:rebaseline:score -- --input artifacts/rebaseline-2025/private/web-human-controls.generated.private.jsonl --output artifacts/rebaseline-2025/human-controls.public.jsonl --scored-at 2026-05-22
|
|
110
|
+
node scripts/rebaseline-summary.mjs --input artifacts/rebaseline-2025/human-controls.public.jsonl --json
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
The manifest row schema is intentionally metadata-first:
|
|
114
|
+
|
|
115
|
+
- required: `sample_id`, `language`, `class`, `register`, `model_family`,
|
|
116
|
+
`provider`, `model`, `generated_at`, `prompt_id`, `decoding`, `postprocess`,
|
|
117
|
+
`redistribution`, `text_hash`
|
|
118
|
+
- optional: `text` only when redistribution permits it, `patina_score`,
|
|
119
|
+
`expected_hot`, `predicted_hot`, `source_url`, `source_license`,
|
|
120
|
+
`source_review`, `reviewer_notes`
|
|
121
|
+
|
|
122
|
+
`text_hash` uses the same `sha256:<hex>` style as runtime manifests. If a row
|
|
123
|
+
contains `text`, the validator checks that the digest matches. If a row is
|
|
124
|
+
`metadata-only`, `private`, or `no-redistribution`, full text must stay out of
|
|
125
|
+
the repository.
|
|
126
|
+
|
|
127
|
+
`npm run benchmark:rebaseline:report` writes the sanitized summary to
|
|
128
|
+
`docs/benchmarks/rebaseline-latest.md` and `.json`. The default report now uses
|
|
129
|
+
the #155 claim-ready 2026 manifest; `tests/quality/rebaseline-manifest.example.jsonl`
|
|
130
|
+
remains the small `BLOCKED` fixture for proving the gate fails incomplete
|
|
131
|
+
evidence.
|
|
132
|
+
|
|
133
|
+
The tracked `artifacts/rebaseline-2025/human-controls.public.jsonl` file is a
|
|
134
|
+
250-row Korean web candidate manifest for validating the provenance and hash-only
|
|
135
|
+
path, balanced at 50 rows per tracked register. It contains no raw text;
|
|
136
|
+
score/outcome fields are deterministic false-positive evidence only and must not
|
|
137
|
+
be used as a threshold or README performance claim by themselves.
|
|
138
|
+
|
|
139
|
+
Use the false-positive form for person-written samples that should feed the
|
|
140
|
+
human/natural side of future calibration:
|
|
141
|
+
<https://github.com/devswha/patina/issues/new?template=false_positive.yml>.
|
|
142
|
+
|
|
74
143
|
## Publication gate
|
|
75
144
|
|
|
76
145
|
Do not publish competitive claims until the report includes:
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
# 2026 modern-model rebaseline
|
|
2
|
+
|
|
3
|
+
Status: claim-ready for the checked-in deterministic benchmark surface as of 2026-05-22.
|
|
4
|
+
Related issues: #155, #157, #160, #303.
|
|
5
|
+
|
|
6
|
+
This page records the first #155-compatible rebaseline after the HC3-era claim was retired. It does not replace the broader protocol in [`2025-rebaseline-plan.md`](2025-rebaseline-plan.md); it is the current public claim surface for Korean and English only.
|
|
7
|
+
|
|
8
|
+
## Inputs
|
|
9
|
+
|
|
10
|
+
| input | rows | public text | notes |
|
|
11
|
+
|---|---:|---:|---|
|
|
12
|
+
| `artifacts/rebaseline-2025/private/modern-generations.private.jsonl` | 600 | 0 | Locally generated raw text from logged-in CLI surfaces; kept ignored under `private/`. |
|
|
13
|
+
| `artifacts/rebaseline-2025/human-controls.public.jsonl` | 100 selected from 250 | 0 | Korean public-web human controls, hash-only and already scored. |
|
|
14
|
+
| `artifacts/rebaseline-2025/private/hape-en.private.jsonl` | 100 selected from HAP-E human rows | 0 | English HAP-E human controls, transformed to hash-only public rows. |
|
|
15
|
+
| `artifacts/rebaseline-2025/rebaseline-2026.scored.public.jsonl` | 800 | 0 | Public scored manifest: metadata, hashes, deterministic scores, and outcomes only. |
|
|
16
|
+
|
|
17
|
+
Modern-model positive cells are balanced at n=100 for each language × family:
|
|
18
|
+
|
|
19
|
+
| language | model family | local surface/model |
|
|
20
|
+
|---|---|---|
|
|
21
|
+
| en | gpt-family | `codex-cli` / `gpt-5.5` |
|
|
22
|
+
| en | claude-family | `claude-cli` / `claude-sonnet-4-6` |
|
|
23
|
+
| en | gemini-family | `gemini-cli` / `gemini-2.5-pro` |
|
|
24
|
+
| ko | gpt-family | `codex-cli` / `gpt-5.5` |
|
|
25
|
+
| ko | claude-family | `claude-cli` / `claude-sonnet-4-6` |
|
|
26
|
+
| ko | gemini-family | `gemini-cli` / `gemini-2.5-pro` |
|
|
27
|
+
|
|
28
|
+
The model rows are ordinary assistant completions requested through CLI tools, not API-temperature-controlled lab completions. Treat the result as a deterministic Patina detector rebaseline for these local surfaces, not as an authorship study.
|
|
29
|
+
|
|
30
|
+
## Claim gate
|
|
31
|
+
|
|
32
|
+
`npm run benchmark:rebaseline:claim-manifest -- --scored-at 2026-05-22` writes the public-safe manifest, then:
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
node scripts/rebaseline-summary.mjs \
|
|
36
|
+
--input artifacts/rebaseline-2025/rebaseline-2026.scored.public.jsonl \
|
|
37
|
+
--write \
|
|
38
|
+
--basename rebaseline-latest \
|
|
39
|
+
--require-claim-ready
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
Latest report: [`docs/benchmarks/rebaseline-latest.md`](../benchmarks/rebaseline-latest.md).
|
|
43
|
+
|
|
44
|
+
| gate | result |
|
|
45
|
+
|---|---:|
|
|
46
|
+
| generator families with n≥100 positive cells | 3 |
|
|
47
|
+
| languages with n≥100 positive cells | 2 |
|
|
48
|
+
| qualified positive language × family cells | 6 |
|
|
49
|
+
| natural/human languages with n≥100 | 2 |
|
|
50
|
+
| complete expected/predicted outcome rows | 800 |
|
|
51
|
+
| public raw text committed | 0 |
|
|
52
|
+
|
|
53
|
+
## Headline metrics
|
|
54
|
+
|
|
55
|
+
All intervals below are Wilson 95% confidence intervals.
|
|
56
|
+
|
|
57
|
+
| metric | value |
|
|
58
|
+
|---|---:|
|
|
59
|
+
| Overall AI catch rate | 67.3% [63.5–71.0%] |
|
|
60
|
+
| Overall human-control false-positive rate | 16.0% [11.6–21.7%] |
|
|
61
|
+
| Precision | 92.7% |
|
|
62
|
+
| F1 | 0.780 |
|
|
63
|
+
| TP/FP/FN/TN | 404/32/196/168 |
|
|
64
|
+
|
|
65
|
+
## Catch rate by language × model family
|
|
66
|
+
|
|
67
|
+
| language | model family | n | catch rate | 95% CI | caught/missed |
|
|
68
|
+
|---|---|---:|---:|---:|---:|
|
|
69
|
+
| en | claude-family | 100 | 74.0% | 64.6%–81.6% | 74/26 |
|
|
70
|
+
| en | gemini-family | 100 | 79.0% | 70.0%–85.8% | 79/21 |
|
|
71
|
+
| en | gpt-family | 100 | 77.0% | 67.8%–84.2% | 77/23 |
|
|
72
|
+
| ko | claude-family | 100 | 68.0% | 58.3%–76.3% | 68/32 |
|
|
73
|
+
| ko | gemini-family | 100 | 62.0% | 52.2%–70.9% | 62/38 |
|
|
74
|
+
| ko | gpt-family | 100 | 44.0% | 34.7%–53.8% | 44/56 |
|
|
75
|
+
|
|
76
|
+
## False-positive rate by language
|
|
77
|
+
|
|
78
|
+
| language | n | FP rate | 95% CI | FP/TN |
|
|
79
|
+
|---|---:|---:|---:|---:|
|
|
80
|
+
| en | 100 | 14.0% | 8.5%–22.1% | 14/86 |
|
|
81
|
+
| ko | 100 | 18.0% | 11.7%–26.7% | 18/82 |
|
|
82
|
+
|
|
83
|
+
## Interpretation
|
|
84
|
+
|
|
85
|
+
- The old README headline, “91% Korean / 76% English,” is no longer the current claim. It came from smaller HC3-era/paired fixtures and overstated current Korean catch behavior.
|
|
86
|
+
- English remains around the previous HC3 headline on these modern CLI samples (74–79% per family), but Korean is materially lower, especially the GPT-family cell at 44%.
|
|
87
|
+
- Human-control false positives remain inside the existing ≤25% acceptance envelope overall, but register-level review is still needed before any threshold tightening.
|
|
88
|
+
- This is an editing-hotspot detector benchmark. It must not be presented as a reliable author-attribution or detector-bypass guarantee.
|
|
89
|
+
|
|
90
|
+
## Reproducibility notes
|
|
91
|
+
|
|
92
|
+
- Raw generated and HAP-E text stays in ignored `artifacts/rebaseline-2025/private/` files. The committed manifest stores `sha256` digests, metadata, Patina scores, and outcome labels only.
|
|
93
|
+
- The generation helper is `scripts/rebaseline-generate-modern.mjs`; the public-safe combiner is `scripts/rebaseline-build-claim-manifest.mjs`.
|
|
94
|
+
- HAP-E is used only for English human controls in this report. The previous #160 HAP-E lexicon-mining note still applies: HAP-E is useful, but it is not a substitute for Korean or 2026 model positives.
|
|
95
|
+
- `docs/benchmarks/rebaseline-latest.json` is the machine-readable companion to the Markdown report.
|
|
96
|
+
|
|
97
|
+
## Remaining research work
|
|
98
|
+
|
|
99
|
+
1. Add lightly/heavily edited-AI rows so rewrite behavior is tested separately from raw assistant completions.
|
|
100
|
+
2. Add non-KO/EN languages once public controls and generated positives reach the same n≥100 cell gate.
|
|
101
|
+
3. Repeat the generation on API surfaces with explicit decoding parameters if a stricter lab-style claim is needed.
|
|
102
|
+
4. Review Korean GPT-family misses before changing thresholds; the current result argues for targeted KO diagnostics rather than global score inflation.
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
# Adversarial MPS audit
|
|
2
|
+
|
|
3
|
+
This report checks whether a rewrite can preserve explicit meaning anchors while still looking AI-like. It is a repo-owned adversarial fixture set, not a public model-performance claim.
|
|
4
|
+
|
|
5
|
+
Fixture source: `tests/quality/adversarial-mps/fixtures.jsonl`
|
|
6
|
+
|
|
7
|
+
## Summary
|
|
8
|
+
|
|
9
|
+
- Fixtures: 10
|
|
10
|
+
- Passing adversarial cases: 10/10
|
|
11
|
+
- Minimum anchor-MPS proxy: 100.0
|
|
12
|
+
- Minimum deterministic AI score: 100.0
|
|
13
|
+
- Gate: MPS proxy ≥90 and deterministic AI score ≥60.
|
|
14
|
+
|
|
15
|
+
## Results
|
|
16
|
+
|
|
17
|
+
| id | lang | register | MPS proxy | AI score | hot paragraphs | status |
|
|
18
|
+
|---|---|---|---:|---:|---:|---|
|
|
19
|
+
| adv-mps-ko-01 | ko | marketing | 100.0 | 100.0 | 1/1 | pass |
|
|
20
|
+
| adv-mps-ko-02 | ko | technical | 100.0 | 100.0 | 1/1 | pass |
|
|
21
|
+
| adv-mps-ko-03 | ko | academic | 100.0 | 100.0 | 1/1 | pass |
|
|
22
|
+
| adv-mps-ko-04 | ko | product-doc | 100.0 | 100.0 | 1/1 | pass |
|
|
23
|
+
| adv-mps-ko-05 | ko | policy | 100.0 | 100.0 | 1/1 | pass |
|
|
24
|
+
| adv-mps-en-01 | en | marketing | 100.0 | 100.0 | 1/1 | pass |
|
|
25
|
+
| adv-mps-en-02 | en | technical | 100.0 | 100.0 | 1/1 | pass |
|
|
26
|
+
| adv-mps-en-03 | en | academic | 100.0 | 100.0 | 1/1 | pass |
|
|
27
|
+
| adv-mps-en-04 | en | support | 100.0 | 100.0 | 1/1 | pass |
|
|
28
|
+
| adv-mps-en-05 | en | strategy | 100.0 | 100.0 | 1/1 | pass |
|
|
29
|
+
|
|
30
|
+
## Interpretation
|
|
31
|
+
|
|
32
|
+
The audit confirms the known gap: an anchor-preservation floor can pass text that still retains AI-marker density. MPS should remain a meaning-safety floor, not a humanness score. A complementary anti-gaming check should penalize repeated AI-marker recurrence after rewrite, especially when MPS is high.
|
|
33
|
+
|
|
34
|
+
## Proposed MPS-v2 companion check
|
|
35
|
+
|
|
36
|
+
Keep MPS unchanged for semantic safety, then add an independent recurrence gate:
|
|
37
|
+
|
|
38
|
+
1. Score the original and rewritten text with deterministic `analyzeText`.
|
|
39
|
+
2. If `MPS ≥ 90` and rewritten AI score remains `≥ 60`, mark the candidate as `style_not_improved`.
|
|
40
|
+
3. In Ouroboros selection, prefer candidates that pass MPS and lower the AI score; do not let high MPS alone rescue a visibly AI-like rewrite.
|
|
41
|
+
4. Report preserved anchors and recurring AI markers separately so users can decide whether to edit more or keep the register.
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
# AI 글과 사람 글 차이의 지표화 — patina 벤치마크 조사 노트
|
|
2
2
|
|
|
3
3
|
작성일: 2026-05-20
|
|
4
|
-
상태: 조사/설계 노트, 구현
|
|
4
|
+
상태: 조사/설계 노트, 부분 구현 — deterministic benchmark와 opt-in live quality scaffold 반영
|
|
5
5
|
|
|
6
6
|
## 1. 질문
|
|
7
7
|
|
|
@@ -18,9 +18,9 @@ patina의 벤치마크에서 측정하려는 것은 단순히 “이 글의 출
|
|
|
18
18
|
| 층 | 파일 | 현재 측정값 | 성격 |
|
|
19
19
|
|---|---|---|---|
|
|
20
20
|
| 결정론적 benchmark | `tests/quality/benchmark.mjs` | accuracy, precision, recall, F1, TP/FP/FN/TN | LLM 호출 없음. stylometry/lexicon hot 판정 회귀 테스트 |
|
|
21
|
-
|
|
|
21
|
+
| opt-in live quality scaffold | `tests/quality/live-quality.mjs` | before/after AI-likeness, deterministic meaning proxy, safe_gain, PASS/WARN/FAIL | 기본 실행은 모델 호출 없이 skip. `OPENCODE_AVAILABLE=1`일 때 rewrite 품질과 의미 보존 proxy 점검 |
|
|
22
22
|
| scoring spec | `core/scoring.md` | AI-likeness, fidelity, combined score, MPS | 점수 정의의 기준 문서 |
|
|
23
|
-
| stylometry spec | `core/stylometry.md` | burstiness CV, MATTR, lexicon density | deterministic signal 정의 |
|
|
23
|
+
| stylometry spec | `core/stylometry.md` | burstiness CV, MATTR, lexicon density, koDiagnostics | deterministic signal 정의 |
|
|
24
24
|
|
|
25
25
|
현재 deterministic hot 판정은 다음 OR 규칙이다.
|
|
26
26
|
|
|
@@ -28,18 +28,28 @@ patina의 벤치마크에서 측정하려는 것은 단순히 “이 글의 출
|
|
|
28
28
|
paragraph is SUSPECT iff
|
|
29
29
|
burstiness_band == "low"
|
|
30
30
|
OR MATTR_band == "low"
|
|
31
|
-
OR lexicon_density > threshold
|
|
31
|
+
OR (lexicon_density > threshold AND lexicon_min_hits is satisfied)
|
|
32
|
+
OR koDiagnostics.hot == true
|
|
32
33
|
```
|
|
33
34
|
|
|
34
|
-
|
|
35
|
+
`burstiness_band`는 단락 문장이 3개 이상일 때만 부여한다. 2문장 이하 CV는
|
|
36
|
+
진단값으로만 남기고, 그 값만으로는 hot 판정을 만들지 않는다.
|
|
37
|
+
한국어/중국어/일본어 lexicon은 단일 hit를 hot으로 보지 않고 audit hint로만 남긴다.
|
|
38
|
+
짧은 전문/정책 문단에서 보통 명사 하나가 걸리는 오탐을 줄이기 위한 보수적 가드다.
|
|
39
|
+
|
|
40
|
+
한국어(`lang=ko`)는 `spacing`, `comma`, `posDiversity` 진단값을 추가로 남긴다.
|
|
41
|
+
`posDiversity`는 형태소 분석기가 아니라 조사/어미 suffix class proxy다. 세 지표가 모두
|
|
42
|
+
보수적 임계값을 넘을 때만 `koDiagnostics.hot`으로 OR 규칙에 들어간다.
|
|
43
|
+
|
|
44
|
+
현재 공개 benchmark 스냅샷은 ko/en/zh/ja fixture 39개 기준 전체 accuracy 1.0이다. 다만 이 corpus는 작고 설계된 synthetic/curated fixture이므로, 일반화 성능 증거로 과신하면 안 된다.
|
|
35
45
|
|
|
36
46
|
### 2.1 단기 benchmark 한계와 rebaseline 계획 (#155/#162)
|
|
37
47
|
|
|
38
48
|
현재 체크인된 deterministic report는 회귀 테스트로는 유용하지만, 공개 성능 주장으로 쓰기에는 아직 좁다.
|
|
39
49
|
|
|
40
|
-
- 표본 수: ko/en/zh/ja suspect-zone fixture
|
|
50
|
+
- 표본 수: ko/en/zh/ja suspect-zone fixture 39개. 언어·장르·출처별 신뢰구간을 낼 만큼 크지 않다.
|
|
41
51
|
- 모델 시대성: 2025+ GPT/Claude/Gemini/Llama/Qwen 계열 생성문 rebaseline은 아직 별도 follow-up이다.
|
|
42
|
-
- 통계 보고: `docs/benchmarks/latest.md`는 fixture 수, lang/class sample size, Wilson 95% CI
|
|
52
|
+
- 통계 보고: `docs/benchmarks/latest.md`는 fixture 수, lang/class sample size, Wilson 95% CI와 `signal_score` 기반 ROC-AUC / PR-AUC / best-F1 threshold 진단을 공개한다. bootstrap interval은 아직 없다.
|
|
43
53
|
- 범위: 현재 수치는 stylometry/lexicon hot 판정 회귀이며, rewrite 품질·MPS·fidelity의 live 품질 점수가 아니다.
|
|
44
54
|
|
|
45
55
|
따라서 README/benchmark 문구는 “현재 fixture에서 회귀가 통과했다”로 해석해야 하며, “새 모델·새 장르에서도 100% 탐지한다”는 의미가 아니다. 다음 rebaseline에서는 최소한 언어 × class × register별 sample size를 고정하고, confidence interval과 threshold sweep을 함께 게시한다.
|
|
@@ -66,7 +76,7 @@ MATTR는 길이에 민감한 TTR 문제를 완화하기 위해 moving window를
|
|
|
66
76
|
|
|
67
77
|
- 근거: Covington & McFall의 MATTR 논문은 단순 TTR의 길이 의존 문제를 지적하고 moving-average 방식을 제안한다.
|
|
68
78
|
- patina 관찰: 영어에서는 보조 신호, 한국어에서는 어절 토큰화 때문에 MATTR가 과대평가되어 약한 신호다.
|
|
69
|
-
- patina 판단: **현행 유지. 한국어에서는
|
|
79
|
+
- patina 판단: **현행 유지. 한국어에서는 주요 신호로 보지 않는다.**
|
|
70
80
|
|
|
71
81
|
### 3.3 AI-favored lexicon density
|
|
72
82
|
|
|
@@ -80,7 +90,7 @@ density = matched_ai_lexicon_entries / token_count * 1000
|
|
|
80
90
|
- 약점: 도메인별 정상 전문어와 AI 포장어를 혼동할 수 있다.
|
|
81
91
|
- patina 판단: **언어/도메인별 calibration이 필요하다.** 특히 한국어 lexicon은 paired corpus mining 방식이 효과적이었다.
|
|
82
92
|
|
|
83
|
-
### 3.4 패턴
|
|
93
|
+
### 3.4 패턴 severity score
|
|
84
94
|
|
|
85
95
|
patina의 가장 중요한 차별점은 “AI detection”보다 “AI writing pattern editing”에 있다. 따라서 다음 패턴들은 detector feature이면서 동시에 rewrite target이다.
|
|
86
96
|
|
|
@@ -91,13 +101,13 @@ patina의 가장 중요한 차별점은 “AI detection”보다 “AI writing p
|
|
|
91
101
|
- 출처 없는 권위 주장
|
|
92
102
|
- viral-hook score-only rhetoric
|
|
93
103
|
|
|
94
|
-
patina 판단: **패턴
|
|
104
|
+
patina 판단: **패턴 점수는 계속 중심축이어야 한다.** 통계 지표는 패턴 스캔의 보조 신호로 둔다.
|
|
95
105
|
|
|
96
|
-
### 3.5 언어모델 확률
|
|
106
|
+
### 3.5 언어모델 확률 지표
|
|
97
107
|
|
|
98
108
|
대표 연구:
|
|
99
109
|
|
|
100
|
-
| 접근 |
|
|
110
|
+
| 접근 | 주요 아이디어 | patina 적용성 |
|
|
101
111
|
|---|---|---|
|
|
102
112
|
| [GLTR](https://arxiv.org/abs/1906.04043) | token probability, rank, entropy로 “너무 예측 가능한 선택”을 시각화 | 설명 가능. 단, local LM 또는 API logprob 필요 |
|
|
103
113
|
| [DetectGPT](https://arxiv.org/abs/2301.11305) | 생성 텍스트가 log-probability curvature의 특정 영역에 놓인다는 가설 | 강력하지만 perturbation/LM logprob 비용 큼 |
|
|
@@ -118,13 +128,15 @@ patina 판단: **기본 benchmark에는 넣지 말고 optional research track으
|
|
|
118
128
|
- sentence opener diversity
|
|
119
129
|
- POS-like rough proxy: 명사형 종결, 수동 표현, nominalization suffix
|
|
120
130
|
|
|
121
|
-
patina 판단:
|
|
131
|
+
patina 판단: **한국어부터 보수적 composite로 시작했다.** 외부 모델 없이 구현 가능하고,
|
|
132
|
+
pattern/lexicon과 다른 축이라 유망하지만, 쉼표 없음 같은 단일 특징은 hot 판정에 넣지 않는다.
|
|
133
|
+
spacing/comma/suffix proxy가 동시에 맞을 때만 `koDiagnostics.hot`을 켠다.
|
|
122
134
|
|
|
123
135
|
## 4. 탐지 연구에서 얻을 주의사항
|
|
124
136
|
|
|
125
137
|
### 4.1 detector는 일반적으로 OOD에 약하다
|
|
126
138
|
|
|
127
|
-
[M4](https://arxiv.org/abs/2305.14902)는 multi-generator, multi-domain, multilingual
|
|
139
|
+
[M4](https://arxiv.org/abs/2305.14902)는 multi-generator, multi-domain, multilingual 맥락에서 detector의 generalization이 어렵다고 보고한다. [RAID](https://arxiv.org/abs/2405.07940)는 sampling strategy, adversarial attack, unseen model 변화에 기존 detector들이 쉽게 속는다고 보고한다.
|
|
128
140
|
|
|
129
141
|
patina 벤치마크도 같은 함정이 있다. synthetic fixture에서 100%가 나와도 다음 상황에서는 깨질 수 있다.
|
|
130
142
|
|
|
@@ -136,7 +148,7 @@ patina 벤치마크도 같은 함정이 있다. synthetic fixture에서 100%가
|
|
|
136
148
|
|
|
137
149
|
### 4.2 단일 “AI 확률”로 말하면 안 된다
|
|
138
150
|
|
|
139
|
-
OpenAI도 기존 AI classifier를 낮은 정확도 때문에 중단했고, 짧은 텍스트·비영어·OOD·편집된 AI 글에서 취약하다고 명시했다. 공식 문서의
|
|
151
|
+
OpenAI도 기존 AI classifier를 낮은 정확도 때문에 중단했고, 짧은 텍스트·비영어·OOD·편집된 AI 글에서 취약하다고 명시했다. 공식 문서의 주요 메시지는 “주요 의사결정 도구로 쓰지 말라”는 것이다.
|
|
140
152
|
|
|
141
153
|
patina도 `AI-generated probability`가 아니라 다음처럼 표현해야 한다.
|
|
142
154
|
|
|
@@ -150,7 +162,7 @@ patina도 `AI-generated probability`가 아니라 다음처럼 표현해야 한
|
|
|
150
162
|
- watermark: 특정 생성 시스템이 사전에 심은 신호 검출
|
|
151
163
|
- patina: 결과 텍스트의 문체적 AI 신호 축소
|
|
152
164
|
|
|
153
|
-
따라서 watermark는 참고 연구로만 두고, patina benchmark의
|
|
165
|
+
따라서 watermark는 참고 연구로만 두고, patina benchmark의 주요 지표로 넣지 않는다.
|
|
154
166
|
|
|
155
167
|
## 5. patina에 맞는 지표 설계안
|
|
156
168
|
|
|
@@ -196,10 +208,10 @@ deterministic_ai_likeness =
|
|
|
196
208
|
|
|
197
209
|
| Feature | 초기 weight | 이유 |
|
|
198
210
|
|---|---:|---|
|
|
199
|
-
| pattern severity | 0.35 | patina의
|
|
211
|
+
| pattern severity | 0.35 | patina의 주요 목적과 직접 연결 |
|
|
200
212
|
| lexicon density | 0.20 | 설명 가능하고 calibration 경험 있음 |
|
|
201
|
-
| burstiness CV | 0.20 | 현재 ko/en/zh/ja suspect-zone fixture에서 유효한
|
|
202
|
-
| function-word divergence | 0.15 |
|
|
213
|
+
| burstiness CV | 0.20 | 현재 ko/en/zh/ja suspect-zone fixture에서 유효한 주요 deterministic 신호 |
|
|
214
|
+
| function-word divergence | 0.15 | 한국어 suffix proxy부터 보수적 composite로 시작. 주제 독립성이 높음 |
|
|
203
215
|
| punctuation/opening rhythm | 0.05 | 가벼운 보조 신호 |
|
|
204
216
|
| MATTR | 0.05 | 한국어에서 약하므로 낮게 시작 |
|
|
205
217
|
|
|
@@ -237,7 +249,7 @@ safe_gain = max(0, humanization_gain) * (meaning_safety / 100)
|
|
|
237
249
|
|
|
238
250
|
#### Stability
|
|
239
251
|
|
|
240
|
-
LLM
|
|
252
|
+
LLM rewrite는 비결정적이므로 같은 fixture를 N회 돌린 분산도 필요하다.
|
|
241
253
|
|
|
242
254
|
```text
|
|
243
255
|
score_stability = stddev(after_ai_likeness over N runs)
|
|
@@ -306,13 +318,13 @@ notes: |
|
|
|
306
318
|
### Phase 1 — 보고 강화, 의존성 없음
|
|
307
319
|
|
|
308
320
|
- `tests/quality/results.json`에 feature vector를 더 상세히 저장
|
|
309
|
-
- threshold sweep과 ROC-AUC/PR-AUC
|
|
321
|
+
- 더 큰 2025+ corpus에서 threshold sweep과 ROC-AUC/PR-AUC 재계산
|
|
310
322
|
- per-register/per-language/per-class 요약 추가
|
|
311
323
|
- `tests/quality/README.md`에 “AI-likeness이지 provenance가 아님” 명시
|
|
312
324
|
|
|
313
325
|
### Phase 2 — corpus 확장
|
|
314
326
|
|
|
315
|
-
- synthetic/curated fixture
|
|
327
|
+
- synthetic/curated fixture 39개에서 real-world sampled fixture로 확장
|
|
316
328
|
- human false positive register를 먼저 늘린다
|
|
317
329
|
- 목표: 언어별 최소 100 human + 100 AI paragraph
|
|
318
330
|
|
|
@@ -357,7 +369,7 @@ patina가 따라야 할 방향은 범용 AI detector가 아니다. 더 강한
|
|
|
357
369
|
4. **단일 AI 확률을 주장하지 않는다.**
|
|
358
370
|
5. 무거운 LM-probability detector는 optional research track으로 격리한다.
|
|
359
371
|
|
|
360
|
-
즉, patina benchmark의
|
|
372
|
+
즉, patina benchmark의 주요 지표는 다음 세 문장으로 요약할 수 있다.
|
|
361
373
|
|
|
362
374
|
```text
|
|
363
375
|
얼마나 AI스럽게 읽히는가?
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
# Blinded human evaluation panel plan
|
|
2
|
+
|
|
3
|
+
Status: study design ready; panel not run.
|
|
4
|
+
Related issue: #159.
|
|
5
|
+
|
|
6
|
+
This plan keeps human preference work separate from deterministic scoring. It should run only with texts that can be shown to reviewers and redistributed or summarized under consent.
|
|
7
|
+
|
|
8
|
+
## Research question
|
|
9
|
+
|
|
10
|
+
Can blinded readers tell which text is the AI-like draft and which one is the Patina rewrite, and do they prefer the rewrite without seeing meaning loss?
|
|
11
|
+
|
|
12
|
+
## Minimum pilot
|
|
13
|
+
|
|
14
|
+
- 30 paired samples;
|
|
15
|
+
- 5 raters per sample;
|
|
16
|
+
- language and register recorded for each pair;
|
|
17
|
+
- randomized A/B order;
|
|
18
|
+
- no model or tool names shown to raters;
|
|
19
|
+
- reviewer consent and redistribution notes stored outside public fixtures unless publishable.
|
|
20
|
+
|
|
21
|
+
## Rater task
|
|
22
|
+
|
|
23
|
+
For each pair, ask:
|
|
24
|
+
|
|
25
|
+
1. Which version reads more natural for the stated context?
|
|
26
|
+
2. Did either version lose a key fact, number, name, or caveat?
|
|
27
|
+
3. Which version would you send with light edits?
|
|
28
|
+
4. Free-text note, optional.
|
|
29
|
+
|
|
30
|
+
## Report shape
|
|
31
|
+
|
|
32
|
+
| metric | output |
|
|
33
|
+
|---|---|
|
|
34
|
+
| naturalness preference | Patina / original / tie counts with confidence interval |
|
|
35
|
+
| meaning concern | rate of reported fact/caveat loss |
|
|
36
|
+
| register split | results by language and register |
|
|
37
|
+
| rater agreement | kappa or raw agreement, depending on labels |
|
|
38
|
+
| exclusions | samples removed and why |
|
|
39
|
+
|
|
40
|
+
## Privacy rule
|
|
41
|
+
|
|
42
|
+
Do not commit reviewer names, private comments, or no-redistribution source text. Public reports can include aggregate counts and short examples only when the source license and reviewer consent allow it.
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
# Cross-judge agreement plan
|
|
2
|
+
|
|
3
|
+
Status: CLI shortcut removed during surface simplification; full matrix blocked on evaluator budget.
|
|
4
|
+
Related issue: #158.
|
|
5
|
+
|
|
6
|
+
A score is more useful when the judge is not from the same model family as the suspected generator. Patina no longer exposes a per-run CLI warning for this; cross-family independence belongs in an explicit evaluation matrix rather than everyday score UX.
|
|
7
|
+
|
|
8
|
+
## Full matrix gate
|
|
9
|
+
|
|
10
|
+
The full issue is still open until a report covers:
|
|
11
|
+
|
|
12
|
+
- 3 generator families × 3 judge families × 30 samples;
|
|
13
|
+
- shared prompts and fixed sample ids;
|
|
14
|
+
- pairwise agreement table;
|
|
15
|
+
- Krippendorff alpha or Cohen/Fleiss kappa where the labels support it;
|
|
16
|
+
- a note when a judge is evaluating its own family.
|
|
17
|
+
|
|
18
|
+
## Matrix template
|
|
19
|
+
|
|
20
|
+
| sample set | generator | judge | n | hot agree | hot disagree | agreement |
|
|
21
|
+
|---|---|---|---:|---:|---:|---:|
|
|
22
|
+
| pending | pending | pending | 0 | 0 | 0 | n/a |
|
|
23
|
+
|
|
24
|
+
Do not fill this table with synthetic numbers. Use it only after the manifest and judge outputs exist.
|
|
@@ -0,0 +1,135 @@
|
|
|
1
|
+
# KO/2025+ Corpus Source Inventory
|
|
2
|
+
|
|
3
|
+
Verified: 2026-05-22. This inventory turns the Korean rebaseline blocker
|
|
4
|
+
(#303/#157/#155/#160) into an executable intake plan. It is not a public
|
|
5
|
+
performance claim.
|
|
6
|
+
|
|
7
|
+
## Decision
|
|
8
|
+
|
|
9
|
+
Use a **metadata-first corpus**:
|
|
10
|
+
|
|
11
|
+
1. keep raw text in `artifacts/rebaseline-2025/` or another private store;
|
|
12
|
+
2. commit only redistributable examples, hashes, metadata, and aggregate reports;
|
|
13
|
+
3. do not publish catch-rate claims until `docs/research/2025-rebaseline-plan.md`
|
|
14
|
+
reaches its n≥100 public claim gate.
|
|
15
|
+
|
|
16
|
+
That lets Patina use current Korean sources without copying restricted corpora
|
|
17
|
+
into the repository.
|
|
18
|
+
|
|
19
|
+
## Source matrix
|
|
20
|
+
|
|
21
|
+
| source | role | evidence | repo policy | first intake |
|
|
22
|
+
|---|---|---|---|---|
|
|
23
|
+
| KatFish / KatFishNet | Korean AI-vs-human seed set; especially useful for #303 punctuation/spacing checks | ACL 2025/arXiv paper says KatFish covers human and four-LLM Korean text across three genres; the public GitHub repo contains `katfish_dataset/*.jsonl` but no repo-level license metadata is exposed | Treat raw rows as **license-review** until a license/redistribution decision is recorded. Aggregate metrics are OK; raw rows stay private. | `docs/benchmarks/katfish-ko-latest.md` reports aggregate-only calibration: Patina KO diagnostics improve KatFish catch rate by +15.9 pp with +0 public-web human-control FP rows. |
|
|
24
|
+
| 모두의 말뭉치 | human Korean controls for product-doc, news/editorial, dialogue, summary registers | NIKL lists 2024/2025 corpora and requires corpus application/approval fields | Do not commit raw text. Store local extracts privately and commit only hashes/metrics unless approval explicitly allows publication. | Apply for research/evaluation use; start with 25 human-control paragraphs after approval. |
|
|
25
|
+
| 한국어 학습자 말뭉치 | false-positive stress test for learner/second-language Korean, not a normal human baseline | Official site describes learner writing/speech corpora and notes original learner material is privacy-protected; Data.go.kr lists research-purpose distribution limits | Use as a separate FP envelope. Do not blend with native Korean controls. Commit metadata/hashes only. | Add `class: natural-human`, `register: learner-writing` only after schema/register decision, or map to `academic-summary` with reviewer note. |
|
|
26
|
+
| HAERAE-HUB/KOREAN-SyntheticText-1.5B | broad synthetic Korean AI-like pool | Hugging Face dataset page shows text parquet with 1.55M rows | Synthetic side only. Check dataset card/license before committing full text; otherwise hash-only. | Sample short paragraphs for lexicon mining candidates, then manually review before pattern changes. |
|
|
27
|
+
| Maintainer-generated 2026 prompts | controlled GPT/Claude/Gemini/open-weight model-era rows | Generated from repo-owned prompts and reproducible metadata | Preferred public seed when provider terms and prompt contents allow redistribution. Keep prompts public; keep vendor UI copies private if unsure. | Generate 5 rows each for GPT-family, Claude-family, Gemini-family, and open-weight across blog/product-doc/chat-update. |
|
|
28
|
+
| Community false-positive submissions | real Patina FP cases | GitHub false-positive issue template captures language/register/score output | Use only with explicit fixture permission. Strip account/private context by default. | Convert accepted issues into hash-only rows first; promote to fixture only after permission. |
|
|
29
|
+
| Public Korean web pages (Korea.kr, Toss Tech, Kakao/Naver/Toss docs, KISTEP/KEI/NRF, Seoul OpenGov) | natural-human Korean pilot controls for source/provenance workflow | Public pages with visible source URLs and licensing/copyright guidance where available | Commit only hash-only metadata until page-level redistribution and attribution are reviewed. Keep raw extracts in ignored `artifacts/rebaseline-2025/private/`. | `artifacts/rebaseline-2025/human-controls.public.jsonl` contains 250 scored hash-only candidate rows, with 50 rows each for blog, product-doc, academic-summary, technical-how-to, and chat-update; positive AI-like rows still block threshold work. |
|
|
30
|
+
|
|
31
|
+
## Intake commands
|
|
32
|
+
|
|
33
|
+
Use the local workspace scaffold:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
npm run benchmark:rebaseline:intake -- \
|
|
37
|
+
--input artifacts/rebaseline-2025/intake.local.jsonl \
|
|
38
|
+
--public-output artifacts/rebaseline-2025/manifest.public.jsonl \
|
|
39
|
+
--private-output artifacts/rebaseline-2025/private/generations.private.jsonl \
|
|
40
|
+
--require-source-review
|
|
41
|
+
|
|
42
|
+
node scripts/rebaseline-summary.mjs \
|
|
43
|
+
--input artifacts/rebaseline-2025/manifest.public.jsonl \
|
|
44
|
+
--json
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
The intake helper computes missing `text_hash` values. If redistribution is not
|
|
48
|
+
public, it strips `text` from the public manifest and writes the full row to the
|
|
49
|
+
private output path.
|
|
50
|
+
|
|
51
|
+
Tracked starter files:
|
|
52
|
+
|
|
53
|
+
- `artifacts/rebaseline-2025/prompts.template.jsonl` — repo-owned prompt
|
|
54
|
+
anchors for Korean academic/종결-다, product-doc, blog, chat/update,
|
|
55
|
+
technical-how-to, and edited-AI rows.
|
|
56
|
+
- `artifacts/rebaseline-2025/intake.local.example.jsonl` — 25 metadata-only
|
|
57
|
+
rows matching the pilot buckets below. The hashes are placeholders; replace
|
|
58
|
+
them locally before treating the file as evidence.
|
|
59
|
+
- `artifacts/rebaseline-2025/sources.ko-public.jsonl` — tracked public-source
|
|
60
|
+
inventory for hash-only web collection.
|
|
61
|
+
- `artifacts/rebaseline-2025/human-controls.public.jsonl` — 250 web-sourced
|
|
62
|
+
Korean natural-human candidate rows, balanced at 50 rows for each tracked
|
|
63
|
+
register. It is hash-only and validates the collection path; absent AI-like
|
|
64
|
+
cells still block public claim changes.
|
|
65
|
+
|
|
66
|
+
To refresh public-web candidates from the tracked source inventory:
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
npm run benchmark:rebaseline:web -- \
|
|
70
|
+
--input artifacts/rebaseline-2025/sources.ko-public.jsonl \
|
|
71
|
+
--output artifacts/rebaseline-2025/private/web-human-controls.generated.private.jsonl \
|
|
72
|
+
--target-per-register 50 \
|
|
73
|
+
--max-per-source 12 \
|
|
74
|
+
--collected-at 2026-05-22
|
|
75
|
+
|
|
76
|
+
npm run benchmark:rebaseline:score -- \
|
|
77
|
+
--input artifacts/rebaseline-2025/private/web-human-controls.generated.private.jsonl \
|
|
78
|
+
--output artifacts/rebaseline-2025/human-controls.public.jsonl \
|
|
79
|
+
--scored-at 2026-05-22
|
|
80
|
+
|
|
81
|
+
node scripts/rebaseline-summary.mjs \
|
|
82
|
+
--input artifacts/rebaseline-2025/human-controls.public.jsonl \
|
|
83
|
+
--json
|
|
84
|
+
|
|
85
|
+
npm run benchmark:katfish-ko -- --write --basename katfish-ko-latest
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
To score those candidates without committing raw text:
|
|
89
|
+
|
|
90
|
+
```bash
|
|
91
|
+
npm run benchmark:rebaseline:score -- \
|
|
92
|
+
--input artifacts/rebaseline-2025/private/web-human-controls.generated.private.jsonl \
|
|
93
|
+
--output artifacts/rebaseline-2025/human-controls.public.jsonl \
|
|
94
|
+
--scored-at 2026-05-22
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## Remaining Korean pilot holes
|
|
98
|
+
|
|
99
|
+
The 250-row public-web pilot proves the collection and scoring path and fills the native-human register coverage gate for #157. Use the original 25-row skeleton only as a local intake template; future rows should fill the positive and comparison holes instead:
|
|
100
|
+
|
|
101
|
+
| bucket | remaining need | notes |
|
|
102
|
+
|---|---:|---|
|
|
103
|
+
| native human controls | 0 for the five tracked registers | Current tracked split is n=50 each for academic-summary, product-doc, chat-update, blog, and technical-how-to. |
|
|
104
|
+
| self-generated AI-like | n≥100 per GPT/Claude/Gemini/open-weight claim cell | Keep prompt ids, model ids, decoding, and provider terms notes; KatFish aggregate closes only the KO diagnostic calibration, not public claim coverage. |
|
|
105
|
+
| lightly/heavily edited AI | at least one light and one heavy edit per target register | Preserve before/after hashes and edit policy. |
|
|
106
|
+
| KatFish aggregate comparison | complete for #303 | Raw rows stay private; tracked output is aggregate-only. |
|
|
107
|
+
| FP submissions / learner stress | separate reviewed envelope | Do not blend learner Korean into the native-human baseline. |
|
|
108
|
+
|
|
109
|
+
Exit criteria for the pilot:
|
|
110
|
+
|
|
111
|
+
- `npm run benchmark:rebaseline:intake -- --input artifacts/rebaseline-2025/intake.local.jsonl --dry-run --require-source-review`
|
|
112
|
+
passes before any rows are scored.
|
|
113
|
+
- `node scripts/rebaseline-summary.mjs --input <manifest>` validates with no errors.
|
|
114
|
+
- every row has `source_review` or `reviewer_notes` explaining redistribution
|
|
115
|
+
status when raw text is absent;
|
|
116
|
+
- no threshold or README catch-rate claim changes are made from the pilot alone;
|
|
117
|
+
- findings are posted back to #303/#157/#155 before #160 lexicon mining starts.
|
|
118
|
+
|
|
119
|
+
## Source links
|
|
120
|
+
|
|
121
|
+
- KatFishNet paper: <https://arxiv.org/abs/2503.00032>
|
|
122
|
+
- KatFishNet repository: <https://github.com/Shinwoo-Park/katfishnet>
|
|
123
|
+
- 모두의 말뭉치 request page: <https://kli.korean.go.kr/corpus/main/requestMain.do>
|
|
124
|
+
- 모두의 말뭉치 introduction: <https://kli.korean.go.kr/m/introduce/corpusIntroduce.do>
|
|
125
|
+
- 한국어 학습자 말뭉치: <https://kcorpus.korean.go.kr/index/goIntroduceSite.do>
|
|
126
|
+
- 한국어 학습자 말뭉치 Data.go.kr metadata: <https://www.data.go.kr/data/15094033/fileData.do>
|
|
127
|
+
- HAERAE Korean SyntheticText: <https://huggingface.co/datasets/HAERAE-HUB/KOREAN-SyntheticText-1.5B>
|
|
128
|
+
- KOGL introduction: <https://www.kogl.or.kr/info/introduce.do>
|
|
129
|
+
- KOGL license guide: <https://www.kogl.or.kr/info/license.do>
|
|
130
|
+
- Korea.kr policy article: <https://www.korea.kr/news/policyNewsView.do?newsId=148959377>
|
|
131
|
+
- MCST KOGL type guide: <https://www.mcst.go.kr/site/s_open/kogl/koglType.jsp>
|
|
132
|
+
- MCST copyright Q&A: <https://www.mcst.go.kr/site/s_policy/copyright/question/question17.jsp>
|
|
133
|
+
- Seoul OpenGov copyright policy: <https://opengov.seoul.go.kr/copyright>
|
|
134
|
+
- Tracked public-web source inventory:
|
|
135
|
+
`artifacts/rebaseline-2025/sources.ko-public.jsonl`
|