@llm-translate/cli 1.0.0-next.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.dockerignore +51 -0
- package/.env.example +33 -0
- package/.github/workflows/docs-pages.yml +57 -0
- package/.github/workflows/release.yml +49 -0
- package/.translaterc.json +44 -0
- package/CLAUDE.md +243 -0
- package/Dockerfile +55 -0
- package/README.md +371 -0
- package/RFC.md +1595 -0
- package/dist/cli/index.d.ts +2 -0
- package/dist/cli/index.js +4494 -0
- package/dist/cli/index.js.map +1 -0
- package/dist/index.d.ts +1152 -0
- package/dist/index.js +3841 -0
- package/dist/index.js.map +1 -0
- package/docker-compose.yml +56 -0
- package/docs/.vitepress/config.ts +161 -0
- package/docs/api/agent.md +262 -0
- package/docs/api/engine.md +274 -0
- package/docs/api/index.md +171 -0
- package/docs/api/providers.md +304 -0
- package/docs/changelog.md +64 -0
- package/docs/cli/dir.md +243 -0
- package/docs/cli/file.md +213 -0
- package/docs/cli/glossary.md +273 -0
- package/docs/cli/index.md +129 -0
- package/docs/cli/init.md +158 -0
- package/docs/cli/serve.md +211 -0
- package/docs/glossary.json +235 -0
- package/docs/guide/chunking.md +272 -0
- package/docs/guide/configuration.md +139 -0
- package/docs/guide/cost-optimization.md +237 -0
- package/docs/guide/docker.md +371 -0
- package/docs/guide/getting-started.md +150 -0
- package/docs/guide/glossary.md +241 -0
- package/docs/guide/index.md +86 -0
- package/docs/guide/ollama.md +515 -0
- package/docs/guide/prompt-caching.md +221 -0
- package/docs/guide/providers.md +232 -0
- package/docs/guide/quality-control.md +206 -0
- package/docs/guide/vitepress-integration.md +265 -0
- package/docs/index.md +63 -0
- package/docs/ja/api/agent.md +262 -0
- package/docs/ja/api/engine.md +274 -0
- package/docs/ja/api/index.md +171 -0
- package/docs/ja/api/providers.md +304 -0
- package/docs/ja/changelog.md +64 -0
- package/docs/ja/cli/dir.md +243 -0
- package/docs/ja/cli/file.md +213 -0
- package/docs/ja/cli/glossary.md +273 -0
- package/docs/ja/cli/index.md +111 -0
- package/docs/ja/cli/init.md +158 -0
- package/docs/ja/guide/chunking.md +271 -0
- package/docs/ja/guide/configuration.md +139 -0
- package/docs/ja/guide/cost-optimization.md +30 -0
- package/docs/ja/guide/getting-started.md +150 -0
- package/docs/ja/guide/glossary.md +214 -0
- package/docs/ja/guide/index.md +32 -0
- package/docs/ja/guide/ollama.md +410 -0
- package/docs/ja/guide/prompt-caching.md +221 -0
- package/docs/ja/guide/providers.md +232 -0
- package/docs/ja/guide/quality-control.md +137 -0
- package/docs/ja/guide/vitepress-integration.md +265 -0
- package/docs/ja/index.md +58 -0
- package/docs/ko/api/agent.md +262 -0
- package/docs/ko/api/engine.md +274 -0
- package/docs/ko/api/index.md +171 -0
- package/docs/ko/api/providers.md +304 -0
- package/docs/ko/changelog.md +64 -0
- package/docs/ko/cli/dir.md +243 -0
- package/docs/ko/cli/file.md +213 -0
- package/docs/ko/cli/glossary.md +273 -0
- package/docs/ko/cli/index.md +111 -0
- package/docs/ko/cli/init.md +158 -0
- package/docs/ko/guide/chunking.md +271 -0
- package/docs/ko/guide/configuration.md +139 -0
- package/docs/ko/guide/cost-optimization.md +30 -0
- package/docs/ko/guide/getting-started.md +150 -0
- package/docs/ko/guide/glossary.md +214 -0
- package/docs/ko/guide/index.md +32 -0
- package/docs/ko/guide/ollama.md +410 -0
- package/docs/ko/guide/prompt-caching.md +221 -0
- package/docs/ko/guide/providers.md +232 -0
- package/docs/ko/guide/quality-control.md +137 -0
- package/docs/ko/guide/vitepress-integration.md +265 -0
- package/docs/ko/index.md +58 -0
- package/docs/zh/api/agent.md +262 -0
- package/docs/zh/api/engine.md +274 -0
- package/docs/zh/api/index.md +171 -0
- package/docs/zh/api/providers.md +304 -0
- package/docs/zh/changelog.md +64 -0
- package/docs/zh/cli/dir.md +243 -0
- package/docs/zh/cli/file.md +213 -0
- package/docs/zh/cli/glossary.md +273 -0
- package/docs/zh/cli/index.md +111 -0
- package/docs/zh/cli/init.md +158 -0
- package/docs/zh/guide/chunking.md +271 -0
- package/docs/zh/guide/configuration.md +139 -0
- package/docs/zh/guide/cost-optimization.md +30 -0
- package/docs/zh/guide/getting-started.md +150 -0
- package/docs/zh/guide/glossary.md +214 -0
- package/docs/zh/guide/index.md +32 -0
- package/docs/zh/guide/ollama.md +410 -0
- package/docs/zh/guide/prompt-caching.md +221 -0
- package/docs/zh/guide/providers.md +232 -0
- package/docs/zh/guide/quality-control.md +137 -0
- package/docs/zh/guide/vitepress-integration.md +265 -0
- package/docs/zh/index.md +58 -0
- package/package.json +91 -0
- package/release.config.mjs +15 -0
- package/schemas/glossary.schema.json +110 -0
- package/src/cli/commands/dir.ts +469 -0
- package/src/cli/commands/file.ts +291 -0
- package/src/cli/commands/glossary.ts +221 -0
- package/src/cli/commands/init.ts +68 -0
- package/src/cli/commands/serve.ts +60 -0
- package/src/cli/index.ts +64 -0
- package/src/cli/options.ts +59 -0
- package/src/core/agent.ts +1119 -0
- package/src/core/chunker.ts +391 -0
- package/src/core/engine.ts +634 -0
- package/src/errors.ts +188 -0
- package/src/index.ts +147 -0
- package/src/integrations/vitepress.ts +549 -0
- package/src/parsers/markdown.ts +383 -0
- package/src/providers/claude.ts +259 -0
- package/src/providers/interface.ts +109 -0
- package/src/providers/ollama.ts +379 -0
- package/src/providers/openai.ts +308 -0
- package/src/providers/registry.ts +153 -0
- package/src/server/index.ts +152 -0
- package/src/server/middleware/auth.ts +93 -0
- package/src/server/middleware/logger.ts +90 -0
- package/src/server/routes/health.ts +84 -0
- package/src/server/routes/translate.ts +210 -0
- package/src/server/types.ts +138 -0
- package/src/services/cache.ts +899 -0
- package/src/services/config.ts +217 -0
- package/src/services/glossary.ts +247 -0
- package/src/types/analysis.ts +164 -0
- package/src/types/index.ts +265 -0
- package/src/types/modes.ts +121 -0
- package/src/types/mqm.ts +157 -0
- package/src/utils/logger.ts +141 -0
- package/src/utils/tokens.ts +116 -0
- package/tests/fixtures/glossaries/ml-glossary.json +53 -0
- package/tests/fixtures/input/lynq-installation.ko.md +350 -0
- package/tests/fixtures/input/lynq-installation.md +350 -0
- package/tests/fixtures/input/simple.ko.md +27 -0
- package/tests/fixtures/input/simple.md +27 -0
- package/tests/unit/chunker.test.ts +229 -0
- package/tests/unit/glossary.test.ts +146 -0
- package/tests/unit/markdown.test.ts +205 -0
- package/tests/unit/tokens.test.ts +81 -0
- package/tsconfig.json +28 -0
- package/tsup.config.ts +34 -0
- package/vitest.config.ts +16 -0
package/RFC.md
ADDED
|
@@ -0,0 +1,1595 @@
|
|
|
1
|
+
# RFC: Consistent LLM-Powered CLI Translation Tool
|
|
2
|
+
|
|
3
|
+
**Project Name:** `llm-translate`
|
|
4
|
+
**Version:** 0.1.0
|
|
5
|
+
**Status:** Draft
|
|
6
|
+
**Author:** Tim Kang
|
|
7
|
+
**Date:** December 2025
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## 1. Executive Summary
|
|
12
|
+
|
|
13
|
+
This document specifies the design and implementation requirements for a CLI-based document translation tool powered by Large Language Models. The tool ensures **translation consistency** through glossary enforcement, context-aware chunking, and iterative quality refinement. It supports multiple document formats (Markdown, HTML, plain text), multiple LLM providers (Claude, OpenAI, Ollama), and both single-file and batch directory processing.
|
|
14
|
+
|
|
15
|
+
### 1.1 Key Differentiators
|
|
16
|
+
|
|
17
|
+
- **Glossary-enforced consistency**: Domain-specific terminology is translated consistently across documents
|
|
18
|
+
- **Quality-aware refinement**: Iterative improvement loop until target quality threshold is met
|
|
19
|
+
- **Provider-agnostic**: Plugin architecture supports Claude, OpenAI, local LLMs, and custom providers
|
|
20
|
+
- **Structure preservation**: AST-based processing maintains document formatting integrity
|
|
21
|
+
- **Unix-friendly**: Supports stdin/stdout for pipeline integration
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## 2. Goals and Non-Goals
|
|
26
|
+
|
|
27
|
+
### 2.1 Goals
|
|
28
|
+
|
|
29
|
+
| ID | Goal | Priority |
|
|
30
|
+
|----|------|----------|
|
|
31
|
+
| G1 | Translate documents while enforcing glossary terminology | P0 |
|
|
32
|
+
| G2 | Support Markdown, HTML, and plain text formats | P0 |
|
|
33
|
+
| G3 | Preserve document structure (headers, code blocks, tables, links) | P0 |
|
|
34
|
+
| G4 | Iteratively refine translations until quality threshold is met | P0 |
|
|
35
|
+
| G5 | Support multiple LLM providers via plugin architecture | P0 |
|
|
36
|
+
| G6 | Process single files via stdin/stdout | P0 |
|
|
37
|
+
| G7 | Batch process directories with configurable patterns | P1 |
|
|
38
|
+
| G8 | Cache translations to avoid redundant API calls | P1 |
|
|
39
|
+
| G9 | Expose functionality via MCP server for agent integration | P2 |
|
|
40
|
+
| G10 | Support Translation Memory (TMX) for leveraging past translations | P2 |
|
|
41
|
+
|
|
42
|
+
### 2.2 Non-Goals
|
|
43
|
+
|
|
44
|
+
- Real-time/streaming translation of live content
|
|
45
|
+
- GUI or web interface (CLI only)
|
|
46
|
+
- Translation of binary formats (PDF, DOCX) — future consideration
|
|
47
|
+
- Built-in OCR or image text extraction
|
|
48
|
+
- Automated glossary generation from source documents
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
## 3. User Stories
|
|
53
|
+
|
|
54
|
+
### 3.1 Primary Use Cases
|
|
55
|
+
|
|
56
|
+
**UC1: Single File Translation (stdin/stdout)**
|
|
57
|
+
```bash
|
|
58
|
+
cat README.md | llm-translate -s en -t ko > README.ko.md
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
**UC2: Single File Translation (file paths)**
|
|
62
|
+
```bash
|
|
63
|
+
llm-translate file docs/guide.md -s en -t ko -o docs/ko/guide.md
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**UC3: Directory Batch Translation**
|
|
67
|
+
```bash
|
|
68
|
+
llm-translate dir ./docs ./docs/ko -s en -t ko --glossary ./glossary.json
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
**UC4: Translation with Custom Quality Threshold**
|
|
72
|
+
```bash
|
|
73
|
+
llm-translate file spec.md -s en -t ja -q 90 --max-iterations 5
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
**UC5: Using Different LLM Provider**
|
|
77
|
+
```bash
|
|
78
|
+
llm-translate file doc.md -s en -t ko --provider openai --model gpt-4o
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
**UC6: Initialize Project Configuration**
|
|
82
|
+
```bash
|
|
83
|
+
llm-translate init
|
|
84
|
+
# Creates .translaterc.json with default settings
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
## 4. Technical Architecture
|
|
90
|
+
|
|
91
|
+
### 4.1 High-Level Architecture
|
|
92
|
+
|
|
93
|
+
```
|
|
94
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
95
|
+
│ CLI Layer │
|
|
96
|
+
│ (Commander.js) │
|
|
97
|
+
├─────────────────────────────────────────────────────────────────┤
|
|
98
|
+
│ Translation Engine │
|
|
99
|
+
│ ┌─────────────┐ ┌─────────────┐ ┌────────────────────────┐ │
|
|
100
|
+
│ │ Format │ │ Semantic │ │ Quality │ │
|
|
101
|
+
│ │ Parsers │ │ Chunker │ │ Evaluator │ │
|
|
102
|
+
│ └─────────────┘ └─────────────┘ └────────────────────────┘ │
|
|
103
|
+
│ ┌──────────────────────────────────────────────────────────┐ │
|
|
104
|
+
│ │ Translation Agent (Self-Refine Loop) │ │
|
|
105
|
+
│ │ Initial → Reflect → Improve → Evaluate → Repeat │ │
|
|
106
|
+
│ └──────────────────────────────────────────────────────────┘ │
|
|
107
|
+
├─────────────────────────────────────────────────────────────────┤
|
|
108
|
+
│ Provider Abstraction Layer │
|
|
109
|
+
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
110
|
+
│ │ Claude │ │ OpenAI │ │ Ollama │ │ Custom │ │
|
|
111
|
+
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
|
|
112
|
+
├─────────────────────────────────────────────────────────────────┤
|
|
113
|
+
│ Supporting Services │
|
|
114
|
+
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
|
|
115
|
+
│ │ Glossary │ │ Cache │ │ Config │ │
|
|
116
|
+
│ │ Manager │ │ Manager │ │ Loader │ │
|
|
117
|
+
│ └────────────┘ └────────────┘ └────────────┘ │
|
|
118
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### 4.2 Technology Stack
|
|
122
|
+
|
|
123
|
+
| Component | Technology | Rationale |
|
|
124
|
+
|-----------|------------|-----------|
|
|
125
|
+
| Runtime | Node.js 24+ | LTS, native ESM support |
|
|
126
|
+
| Language | TypeScript 5.x | Type safety, ecosystem maturity |
|
|
127
|
+
| CLI Framework | Commander.js | Industry standard, excellent DX |
|
|
128
|
+
| Markdown Parser | unified/remark | 150+ plugins, AST-based |
|
|
129
|
+
| HTML Parser | cheerio | jQuery-like API, fast |
|
|
130
|
+
| LLM SDK | Vercel AI SDK | Multi-provider, streaming support |
|
|
131
|
+
| Testing | Vitest | Fast, ESM-native |
|
|
132
|
+
| Build | tsup | Simple, fast bundling |
|
|
133
|
+
| Package Manager | yarn | Modern, workspaces support |
|
|
134
|
+
|
|
135
|
+
### 4.3 Project Structure
|
|
136
|
+
|
|
137
|
+
```
|
|
138
|
+
llm-translate/
|
|
139
|
+
├── src/
|
|
140
|
+
│ ├── cli/
|
|
141
|
+
│ │ ├── index.ts # Entry point
|
|
142
|
+
│ │ ├── commands/
|
|
143
|
+
│ │ │ ├── file.ts # Single file command
|
|
144
|
+
│ │ │ ├── dir.ts # Directory batch command
|
|
145
|
+
│ │ │ ├── init.ts # Project initialization
|
|
146
|
+
│ │ │ └── glossary.ts # Glossary management
|
|
147
|
+
│ │ └── options.ts # Shared CLI options
|
|
148
|
+
│ ├── core/
|
|
149
|
+
│ │ ├── engine.ts # Main translation engine
|
|
150
|
+
│ │ ├── agent.ts # Self-refine translation agent
|
|
151
|
+
│ │ ├── chunker.ts # Semantic document chunker
|
|
152
|
+
│ │ └── evaluator.ts # Quality evaluation
|
|
153
|
+
│ ├── parsers/
|
|
154
|
+
│ │ ├── markdown.ts # Markdown AST processing
|
|
155
|
+
│ │ ├── html.ts # HTML processing
|
|
156
|
+
│ │ └── plaintext.ts # Plain text processing
|
|
157
|
+
│ ├── providers/
|
|
158
|
+
│ │ ├── interface.ts # Provider interface definition
|
|
159
|
+
│ │ ├── registry.ts # Provider registry
|
|
160
|
+
│ │ ├── claude.ts # Claude adapter
|
|
161
|
+
│ │ ├── openai.ts # OpenAI adapter
|
|
162
|
+
│ │ └── ollama.ts # Ollama adapter
|
|
163
|
+
│ ├── services/
|
|
164
|
+
│ │ ├── glossary.ts # Glossary management
|
|
165
|
+
│ │ ├── cache.ts # Translation cache
|
|
166
|
+
│ │ └── config.ts # Configuration loader
|
|
167
|
+
│ ├── types/
|
|
168
|
+
│ │ └── index.ts # Shared type definitions
|
|
169
|
+
│ └── utils/
|
|
170
|
+
│ ├── tokens.ts # Token counting utilities
|
|
171
|
+
│ └── logger.ts # Logging utilities
|
|
172
|
+
├── tests/
|
|
173
|
+
├── .translaterc.json # Example config
|
|
174
|
+
├── package.json
|
|
175
|
+
├── tsconfig.json
|
|
176
|
+
└── README.md
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
---
|
|
180
|
+
|
|
181
|
+
## 5. Core Interfaces and Types
|
|
182
|
+
|
|
183
|
+
### 5.1 Configuration Types
|
|
184
|
+
|
|
185
|
+
```typescript
|
|
186
|
+
// src/types/config.ts
|
|
187
|
+
|
|
188
|
+
export interface TranslateConfig {
|
|
189
|
+
version: string;
|
|
190
|
+
project?: {
|
|
191
|
+
name: string;
|
|
192
|
+
description: string;
|
|
193
|
+
purpose: string; // Used for context if not specified per-request
|
|
194
|
+
};
|
|
195
|
+
languages: {
|
|
196
|
+
source: string; // ISO 639-1 code
|
|
197
|
+
targets: string[];
|
|
198
|
+
styles?: Record<string, string>; // Per-language style instructions (e.g., { "ko": "경어체", "ja": "です・ます調" })
|
|
199
|
+
};
|
|
200
|
+
provider: {
|
|
201
|
+
default: ProviderName;
|
|
202
|
+
model?: string;
|
|
203
|
+
fallback?: ProviderName[];
|
|
204
|
+
apiKeys?: Record<ProviderName, string>; // Optional, prefer env vars
|
|
205
|
+
};
|
|
206
|
+
quality: {
|
|
207
|
+
threshold: number; // 0-100, default: 85
|
|
208
|
+
maxIterations: number; // default: 4
|
|
209
|
+
evaluationMethod: 'llm' | 'embedding' | 'hybrid';
|
|
210
|
+
};
|
|
211
|
+
chunking: {
|
|
212
|
+
maxTokens: number; // default: 1024
|
|
213
|
+
overlapTokens: number; // default: 150
|
|
214
|
+
preserveStructure: boolean;
|
|
215
|
+
};
|
|
216
|
+
glossary?: {
|
|
217
|
+
path: string;
|
|
218
|
+
strict: boolean; // Fail if glossary term not applied
|
|
219
|
+
};
|
|
220
|
+
paths: {
|
|
221
|
+
output: string; // Supports {lang} placeholder
|
|
222
|
+
cache?: string;
|
|
223
|
+
};
|
|
224
|
+
ignore?: string[]; // Glob patterns to ignore
|
|
225
|
+
}
|
|
226
|
+
|
|
227
|
+
export type ProviderName = 'claude' | 'openai' | 'ollama' | 'custom';
|
|
228
|
+
```
|
|
229
|
+
|
|
230
|
+
### 5.2 Glossary Types
|
|
231
|
+
|
|
232
|
+
```typescript
|
|
233
|
+
// src/types/glossary.ts
|
|
234
|
+
|
|
235
|
+
export interface Glossary {
|
|
236
|
+
metadata: {
|
|
237
|
+
name: string;
|
|
238
|
+
sourceLang: string;
|
|
239
|
+
targetLangs: string[]; // Multiple target languages
|
|
240
|
+
version: string;
|
|
241
|
+
domain?: string;
|
|
242
|
+
};
|
|
243
|
+
terms: GlossaryTerm[];
|
|
244
|
+
}
|
|
245
|
+
|
|
246
|
+
export interface GlossaryTerm {
|
|
247
|
+
source: string;
|
|
248
|
+
targets: Record<string, string>; // { "ko": "파드", "ja": "ポッド", "zh-CN": "Pod" }
|
|
249
|
+
context?: string; // Usage context hint
|
|
250
|
+
caseSensitive?: boolean; // default: false
|
|
251
|
+
doNotTranslate?: boolean; // Keep source as-is for ALL languages
|
|
252
|
+
doNotTranslateFor?: string[]; // Keep source as-is for SPECIFIC languages
|
|
253
|
+
partOfSpeech?: 'noun' | 'verb' | 'adjective' | 'other';
|
|
254
|
+
notes?: string;
|
|
255
|
+
}
|
|
256
|
+
|
|
257
|
+
// Runtime resolved glossary for a specific target language
|
|
258
|
+
export interface ResolvedGlossary {
|
|
259
|
+
metadata: {
|
|
260
|
+
name: string;
|
|
261
|
+
sourceLang: string;
|
|
262
|
+
targetLang: string; // Single resolved target
|
|
263
|
+
version: string;
|
|
264
|
+
domain?: string;
|
|
265
|
+
};
|
|
266
|
+
terms: ResolvedGlossaryTerm[];
|
|
267
|
+
}
|
|
268
|
+
|
|
269
|
+
export interface ResolvedGlossaryTerm {
|
|
270
|
+
source: string;
|
|
271
|
+
target: string; // Resolved for specific language
|
|
272
|
+
context?: string;
|
|
273
|
+
caseSensitive: boolean;
|
|
274
|
+
doNotTranslate: boolean;
|
|
275
|
+
}
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
### 5.3 Provider Interface
|
|
279
|
+
|
|
280
|
+
```typescript
|
|
281
|
+
// src/providers/interface.ts
|
|
282
|
+
|
|
283
|
+
export interface ChatMessage {
|
|
284
|
+
role: 'system' | 'user' | 'assistant';
|
|
285
|
+
content: string;
|
|
286
|
+
}
|
|
287
|
+
|
|
288
|
+
export interface ChatRequest {
|
|
289
|
+
messages: ChatMessage[];
|
|
290
|
+
model?: string;
|
|
291
|
+
temperature?: number;
|
|
292
|
+
maxTokens?: number;
|
|
293
|
+
}
|
|
294
|
+
|
|
295
|
+
export interface ChatResponse {
|
|
296
|
+
content: string;
|
|
297
|
+
usage: {
|
|
298
|
+
inputTokens: number;
|
|
299
|
+
outputTokens: number;
|
|
300
|
+
};
|
|
301
|
+
model: string;
|
|
302
|
+
finishReason: 'stop' | 'length' | 'error';
|
|
303
|
+
}
|
|
304
|
+
|
|
305
|
+
export interface ModelInfo {
|
|
306
|
+
maxContextTokens: number;
|
|
307
|
+
supportsStreaming: boolean;
|
|
308
|
+
costPer1kInput?: number;
|
|
309
|
+
costPer1kOutput?: number;
|
|
310
|
+
}
|
|
311
|
+
|
|
312
|
+
export interface LLMProvider {
|
|
313
|
+
readonly name: ProviderName;
|
|
314
|
+
|
|
315
|
+
chat(request: ChatRequest): Promise<ChatResponse>;
|
|
316
|
+
stream(request: ChatRequest): AsyncIterable<string>;
|
|
317
|
+
countTokens(text: string): number;
|
|
318
|
+
getModelInfo(model?: string): ModelInfo;
|
|
319
|
+
}
|
|
320
|
+
|
|
321
|
+
export interface ProviderConfig {
|
|
322
|
+
apiKey?: string;
|
|
323
|
+
baseUrl?: string;
|
|
324
|
+
defaultModel?: string;
|
|
325
|
+
timeout?: number;
|
|
326
|
+
}
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
### 5.4 Translation Types
|
|
330
|
+
|
|
331
|
+
```typescript
|
|
332
|
+
// src/types/translation.ts
|
|
333
|
+
|
|
334
|
+
export interface TranslationRequest {
|
|
335
|
+
content: string;
|
|
336
|
+
sourceLang: string;
|
|
337
|
+
targetLang: string;
|
|
338
|
+
format: 'markdown' | 'html' | 'text';
|
|
339
|
+
glossary?: Glossary;
|
|
340
|
+
context?: {
|
|
341
|
+
documentPurpose?: string;
|
|
342
|
+
previousChunks?: string[];
|
|
343
|
+
documentSummary?: string;
|
|
344
|
+
};
|
|
345
|
+
options?: {
|
|
346
|
+
qualityThreshold?: number;
|
|
347
|
+
maxIterations?: number;
|
|
348
|
+
preserveFormatting?: boolean;
|
|
349
|
+
};
|
|
350
|
+
}
|
|
351
|
+
|
|
352
|
+
export interface TranslationResult {
|
|
353
|
+
content: string;
|
|
354
|
+
metadata: {
|
|
355
|
+
qualityScore: number;
|
|
356
|
+
iterations: number;
|
|
357
|
+
tokensUsed: {
|
|
358
|
+
input: number;
|
|
359
|
+
output: number;
|
|
360
|
+
};
|
|
361
|
+
duration: number;
|
|
362
|
+
provider: string;
|
|
363
|
+
model: string;
|
|
364
|
+
};
|
|
365
|
+
glossaryCompliance?: {
|
|
366
|
+
applied: string[];
|
|
367
|
+
missed: string[];
|
|
368
|
+
};
|
|
369
|
+
}
|
|
370
|
+
|
|
371
|
+
export interface ChunkResult {
|
|
372
|
+
original: string;
|
|
373
|
+
translated: string;
|
|
374
|
+
startOffset: number;
|
|
375
|
+
endOffset: number;
|
|
376
|
+
qualityScore: number;
|
|
377
|
+
}
|
|
378
|
+
|
|
379
|
+
export interface DocumentResult {
|
|
380
|
+
content: string;
|
|
381
|
+
chunks: ChunkResult[];
|
|
382
|
+
metadata: {
|
|
383
|
+
totalTokensUsed: number;
|
|
384
|
+
totalDuration: number;
|
|
385
|
+
averageQuality: number;
|
|
386
|
+
};
|
|
387
|
+
}
|
|
388
|
+
```
|
|
389
|
+
|
|
390
|
+
### 5.5 Chunking Types
|
|
391
|
+
|
|
392
|
+
```typescript
|
|
393
|
+
// src/types/chunking.ts
|
|
394
|
+
|
|
395
|
+
export interface Chunk {
|
|
396
|
+
id: string;
|
|
397
|
+
content: string;
|
|
398
|
+
type: 'translatable' | 'preserve'; // preserve = code blocks, etc.
|
|
399
|
+
startOffset: number;
|
|
400
|
+
endOffset: number;
|
|
401
|
+
metadata?: {
|
|
402
|
+
headerHierarchy?: string[]; // ["# Title", "## Section"]
|
|
403
|
+
previousContext?: string;
|
|
404
|
+
};
|
|
405
|
+
}
|
|
406
|
+
|
|
407
|
+
export interface ChunkingConfig {
|
|
408
|
+
maxTokens: number;
|
|
409
|
+
overlapTokens: number;
|
|
410
|
+
separators: string[];
|
|
411
|
+
preservePatterns: RegExp[]; // Patterns to keep intact (code blocks)
|
|
412
|
+
}
|
|
413
|
+
```
|
|
414
|
+
|
|
415
|
+
### 5.6 MQM Quality Evaluation Types
|
|
416
|
+
|
|
417
|
+
Based on [MQM (Multidimensional Quality Metrics)](https://themqm.org/) framework.
|
|
418
|
+
|
|
419
|
+
```typescript
|
|
420
|
+
// src/types/mqm.ts
|
|
421
|
+
|
|
422
|
+
/**
|
|
423
|
+
* MQM Error Categories
|
|
424
|
+
* Based on MQM framework used in WMT evaluations
|
|
425
|
+
*/
|
|
426
|
+
export type MQMErrorType =
|
|
427
|
+
// Accuracy errors
|
|
428
|
+
| 'accuracy/mistranslation'
|
|
429
|
+
| 'accuracy/omission'
|
|
430
|
+
| 'accuracy/addition'
|
|
431
|
+
| 'accuracy/untranslated'
|
|
432
|
+
// Fluency errors
|
|
433
|
+
| 'fluency/grammar'
|
|
434
|
+
| 'fluency/spelling'
|
|
435
|
+
| 'fluency/register'
|
|
436
|
+
| 'fluency/inconsistency'
|
|
437
|
+
// Style errors
|
|
438
|
+
| 'style/awkward'
|
|
439
|
+
| 'style/unidiomatic';
|
|
440
|
+
|
|
441
|
+
/**
|
|
442
|
+
* MQM Severity Levels with scoring weights
|
|
443
|
+
*/
|
|
444
|
+
export type MQMSeverity = 'minor' | 'major' | 'critical';
|
|
445
|
+
|
|
446
|
+
export const MQM_SEVERITY_WEIGHTS: Record<MQMSeverity, number> = {
|
|
447
|
+
minor: 1,
|
|
448
|
+
major: 5,
|
|
449
|
+
critical: 25,
|
|
450
|
+
};
|
|
451
|
+
|
|
452
|
+
/**
|
|
453
|
+
* Individual MQM error annotation
|
|
454
|
+
*/
|
|
455
|
+
export interface MQMError {
|
|
456
|
+
type: MQMErrorType;
|
|
457
|
+
severity: MQMSeverity;
|
|
458
|
+
span: string; // The affected text in translation
|
|
459
|
+
suggestion: string; // Suggested correction
|
|
460
|
+
explanation?: string; // Brief reason for the error
|
|
461
|
+
sourceSpan?: string; // Corresponding source text (if applicable)
|
|
462
|
+
}
|
|
463
|
+
|
|
464
|
+
/**
|
|
465
|
+
* MQM evaluation result
|
|
466
|
+
*/
|
|
467
|
+
export interface MQMEvaluation {
|
|
468
|
+
errors: MQMError[];
|
|
469
|
+
score: number; // 100 - sum(error weights), min 0
|
|
470
|
+
summary: string; // Brief overall assessment
|
|
471
|
+
breakdown: {
|
|
472
|
+
accuracy: number; // Count of accuracy errors
|
|
473
|
+
fluency: number; // Count of fluency errors
|
|
474
|
+
style: number; // Count of style errors
|
|
475
|
+
};
|
|
476
|
+
}
|
|
477
|
+
|
|
478
|
+
/**
|
|
479
|
+
* Calculate MQM score from errors
|
|
480
|
+
*/
|
|
481
|
+
export function calculateMQMScore(errors: MQMError[]): number {
|
|
482
|
+
const totalPenalty = errors.reduce(
|
|
483
|
+
(sum, err) => sum + MQM_SEVERITY_WEIGHTS[err.severity],
|
|
484
|
+
0
|
|
485
|
+
);
|
|
486
|
+
return Math.max(0, 100 - totalPenalty);
|
|
487
|
+
}
|
|
488
|
+
```
|
|
489
|
+
|
|
490
|
+
### 5.7 Pre-Translation Analysis Types
|
|
491
|
+
|
|
492
|
+
Based on [MAPS (Multi-Aspect Prompting and Selection)](https://github.com/zwhe99/MAPS-mt) framework.
|
|
493
|
+
|
|
494
|
+
```typescript
|
|
495
|
+
// src/types/analysis.ts
|
|
496
|
+
|
|
497
|
+
/**
|
|
498
|
+
* Key term identified during pre-analysis
|
|
499
|
+
*/
|
|
500
|
+
export interface AnalyzedTerm {
|
|
501
|
+
term: string;
|
|
502
|
+
context: string;
|
|
503
|
+
suggestedTranslation?: string;
|
|
504
|
+
fromGlossary: boolean;
|
|
505
|
+
}
|
|
506
|
+
|
|
507
|
+
/**
|
|
508
|
+
* Ambiguous phrase that needs clarification
|
|
509
|
+
*/
|
|
510
|
+
export interface AmbiguousPhrase {
|
|
511
|
+
phrase: string;
|
|
512
|
+
interpretations: string[];
|
|
513
|
+
recommendation: string;
|
|
514
|
+
}
|
|
515
|
+
|
|
516
|
+
/**
|
|
517
|
+
* Domain classification for the content
|
|
518
|
+
*/
|
|
519
|
+
export type ContentDomain =
|
|
520
|
+
| 'technical'
|
|
521
|
+
| 'marketing'
|
|
522
|
+
| 'legal'
|
|
523
|
+
| 'medical'
|
|
524
|
+
| 'general';
|
|
525
|
+
|
|
526
|
+
/**
|
|
527
|
+
* Register/formality recommendation
|
|
528
|
+
*/
|
|
529
|
+
export type RegisterLevel = 'formal' | 'informal' | 'neutral';
|
|
530
|
+
|
|
531
|
+
/**
|
|
532
|
+
* Pre-translation analysis result (MAPS-style)
|
|
533
|
+
*/
|
|
534
|
+
export interface PreTranslationAnalysis {
|
|
535
|
+
/** Key domain-specific terms identified */
|
|
536
|
+
keyTerms: AnalyzedTerm[];
|
|
537
|
+
|
|
538
|
+
/** Phrases with multiple possible interpretations */
|
|
539
|
+
ambiguousPhrases: AmbiguousPhrase[];
|
|
540
|
+
|
|
541
|
+
/** Items that should NOT be translated (code, URLs, names) */
|
|
542
|
+
preserveExact: string[];
|
|
543
|
+
|
|
544
|
+
/** Identified translation challenges for this language pair */
|
|
545
|
+
challenges: string[];
|
|
546
|
+
|
|
547
|
+
/** Detected content domain */
|
|
548
|
+
domain: ContentDomain;
|
|
549
|
+
|
|
550
|
+
/** Recommended formality level */
|
|
551
|
+
registerRecommendation: RegisterLevel;
|
|
552
|
+
}
|
|
553
|
+
|
|
554
|
+
/**
|
|
555
|
+
* Translation agent configuration with analysis options
|
|
556
|
+
*/
|
|
557
|
+
export interface TranslationAgentConfig {
|
|
558
|
+
provider: LLMProvider;
|
|
559
|
+
qualityThreshold?: number;
|
|
560
|
+
maxIterations?: number;
|
|
561
|
+
|
|
562
|
+
/** Enable pre-translation analysis (MAPS) - default: true */
|
|
563
|
+
enableAnalysis?: boolean;
|
|
564
|
+
|
|
565
|
+
/** Use MQM-based evaluation - default: true */
|
|
566
|
+
useMQMEvaluation?: boolean;
|
|
567
|
+
|
|
568
|
+
/** Translation mode affecting pipeline */
|
|
569
|
+
mode?: 'fast' | 'balanced' | 'quality';
|
|
570
|
+
}
|
|
571
|
+
```
|
|
572
|
+
|
|
573
|
+
### 5.8 Translation Mode Configurations
|
|
574
|
+
|
|
575
|
+
```typescript
|
|
576
|
+
// src/types/modes.ts
|
|
577
|
+
|
|
578
|
+
/**
|
|
579
|
+
* Translation mode presets
|
|
580
|
+
*/
|
|
581
|
+
export type TranslationMode = 'fast' | 'balanced' | 'quality';
|
|
582
|
+
|
|
583
|
+
export interface ModeConfig {
|
|
584
|
+
enableAnalysis: boolean;
|
|
585
|
+
useMQMEvaluation: boolean;
|
|
586
|
+
maxIterations: number;
|
|
587
|
+
qualityThreshold: number;
|
|
588
|
+
}
|
|
589
|
+
|
|
590
|
+
export const MODE_PRESETS: Record<TranslationMode, ModeConfig> = {
|
|
591
|
+
/** Fast mode: Single pass, no evaluation */
|
|
592
|
+
fast: {
|
|
593
|
+
enableAnalysis: false,
|
|
594
|
+
useMQMEvaluation: false,
|
|
595
|
+
maxIterations: 1,
|
|
596
|
+
qualityThreshold: 0, // Skip threshold check
|
|
597
|
+
},
|
|
598
|
+
|
|
599
|
+
/** Balanced mode: TEaR with simplified evaluation */
|
|
600
|
+
balanced: {
|
|
601
|
+
enableAnalysis: false,
|
|
602
|
+
useMQMEvaluation: true,
|
|
603
|
+
maxIterations: 2,
|
|
604
|
+
qualityThreshold: 75,
|
|
605
|
+
},
|
|
606
|
+
|
|
607
|
+
/** Quality mode: Full MAPS + TEaR pipeline */
|
|
608
|
+
quality: {
|
|
609
|
+
enableAnalysis: true,
|
|
610
|
+
useMQMEvaluation: true,
|
|
611
|
+
maxIterations: 4,
|
|
612
|
+
qualityThreshold: 85,
|
|
613
|
+
},
|
|
614
|
+
};
|
|
615
|
+
```
|
|
616
|
+
|
|
617
|
+
---
|
|
618
|
+
|
|
619
|
+
## 6. CLI Specification
|
|
620
|
+
|
|
621
|
+
### 6.1 Command Structure
|
|
622
|
+
|
|
623
|
+
```
|
|
624
|
+
llm-translate <command> [options]
|
|
625
|
+
|
|
626
|
+
Commands:
|
|
627
|
+
file <input> [output] Translate a single file
|
|
628
|
+
dir <input> <output> Translate all files in directory
|
|
629
|
+
init Initialize project configuration
|
|
630
|
+
glossary <subcommand> Manage glossary (add, remove, list, validate)
|
|
631
|
+
|
|
632
|
+
Global Options:
|
|
633
|
+
-s, --source-lang <lang> Source language code (required)
|
|
634
|
+
-t, --target-lang <lang> Target language code (required)
|
|
635
|
+
-c, --config <path> Path to config file (default: .translaterc.json)
|
|
636
|
+
-v, --verbose Enable verbose logging
|
|
637
|
+
-q, --quiet Suppress non-error output
|
|
638
|
+
--version Show version number
|
|
639
|
+
--help Show help
|
|
640
|
+
|
|
641
|
+
Translation Options:
|
|
642
|
+
-g, --glossary <path> Path to glossary file
|
|
643
|
+
-p, --provider <name> LLM provider (claude|openai|ollama)
|
|
644
|
+
-m, --model <name> Model name
|
|
645
|
+
--mode <mode> Translation mode: fast|balanced|quality (default: balanced)
|
|
646
|
+
--quality <0-100> Quality threshold (default: 85, overrides mode)
|
|
647
|
+
--max-iterations <n> Max refinement iterations (default: 4, overrides mode)
|
|
648
|
+
--no-analysis Disable pre-translation analysis (MAPS)
|
|
649
|
+
--no-mqm Use simple evaluation instead of MQM
|
|
650
|
+
|
|
651
|
+
Output Options:
|
|
652
|
+
-o, --output <path> Output path (file or directory)
|
|
653
|
+
-f, --format <fmt> Force output format (md|html|txt)
|
|
654
|
+
--dry-run Show what would be translated
|
|
655
|
+
--json Output results as JSON
|
|
656
|
+
|
|
657
|
+
Advanced Options:
|
|
658
|
+
--chunk-size <tokens> Max tokens per chunk (default: 1024)
|
|
659
|
+
--parallel <n> Parallel file processing (default: 3)
|
|
660
|
+
--no-cache Disable translation cache
|
|
661
|
+
--context <text> Additional context for translation
|
|
662
|
+
```
|
|
663
|
+
|
|
664
|
+
### 6.2 Usage Examples
|
|
665
|
+
|
|
666
|
+
```bash
|
|
667
|
+
# Basic single file translation
|
|
668
|
+
llm-translate file README.md -s en -t ko -o README.ko.md
|
|
669
|
+
|
|
670
|
+
# Stdin/stdout pipeline
|
|
671
|
+
cat doc.md | llm-translate -s en -t ja > doc.ja.md
|
|
672
|
+
|
|
673
|
+
# Batch directory with glossary
|
|
674
|
+
llm-translate dir ./docs ./docs/ko \
|
|
675
|
+
-s en -t ko \
|
|
676
|
+
--glossary ./k8s-glossary.json \
|
|
677
|
+
--quality 90
|
|
678
|
+
|
|
679
|
+
# Using OpenAI instead of Claude
|
|
680
|
+
llm-translate file guide.md -s en -t de \
|
|
681
|
+
--provider openai \
|
|
682
|
+
--model gpt-4o
|
|
683
|
+
|
|
684
|
+
# Dry run to preview
|
|
685
|
+
llm-translate dir ./src ./src/i18n -s en -t ja --dry-run
|
|
686
|
+
|
|
687
|
+
# Initialize new project
|
|
688
|
+
llm-translate init
|
|
689
|
+
|
|
690
|
+
# Validate glossary
|
|
691
|
+
llm-translate glossary validate ./glossary.json
|
|
692
|
+
```
|
|
693
|
+
|
|
694
|
+
### 6.3 Exit Codes
|
|
695
|
+
|
|
696
|
+
| Code | Meaning |
|
|
697
|
+
|------|---------|
|
|
698
|
+
| 0 | Success |
|
|
699
|
+
| 1 | General error |
|
|
700
|
+
| 2 | Invalid arguments |
|
|
701
|
+
| 3 | File not found |
|
|
702
|
+
| 4 | Translation quality threshold not met |
|
|
703
|
+
| 5 | Provider/API error |
|
|
704
|
+
| 6 | Glossary validation failed |
|
|
705
|
+
|
|
706
|
+
---
|
|
707
|
+
|
|
708
|
+
## 7. Core Algorithm: Self-Refine Translation
|
|
709
|
+
|
|
710
|
+
### 7.1 Translation Agent Flow
|
|
711
|
+
|
|
712
|
+
The translation pipeline follows a multi-step approach inspired by:
|
|
713
|
+
- [MAPS (Multi-Aspect Prompting and Selection)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00642/119992) - TACL 2024
|
|
714
|
+
- [TEaR (Translate, Estimate, Refine)](https://arxiv.org/abs/2402.16379) - NAACL 2025
|
|
715
|
+
- [Translating Step-by-Step](https://arxiv.org/abs/2409.06790) - Google, WMT 2024
|
|
716
|
+
|
|
717
|
+
```
|
|
718
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
719
|
+
│ TRANSLATION AGENT │
|
|
720
|
+
├─────────────────────────────────────────────────────────────┤
|
|
721
|
+
│ │
|
|
722
|
+
│ ┌──────────────┐ │
|
|
723
|
+
│ │ 1. PREPARE │ Load glossary, build context │
|
|
724
|
+
│ └──────┬───────┘ │
|
|
725
|
+
│ ▼ │
|
|
726
|
+
│ ┌──────────────┐ │
|
|
727
|
+
│ │ 2. ANALYZE │ Pre-translation analysis (MAPS) │
|
|
728
|
+
│ │ (Optional) │ Extract keywords, identify challenges │
|
|
729
|
+
│ └──────┬───────┘ │
|
|
730
|
+
│ ▼ │
|
|
731
|
+
│ ┌──────────────┐ │
|
|
732
|
+
│ │ 3. INITIAL │ Generate first translation │
|
|
733
|
+
│ │ TRANSLATE │ with glossary + context + analysis │
|
|
734
|
+
│ └──────┬───────┘ │
|
|
735
|
+
│ ▼ │
|
|
736
|
+
│ ┌──────────────┐ ┌─────────────────────────────┐ │
|
|
737
|
+
│ │ 4. EVALUATE │────▶│ Quality >= Threshold? │ │
|
|
738
|
+
│ │ (MQM) │ │ OR Max iterations reached? │ │
|
|
739
|
+
│ └──────┬───────┘ └─────────────┬───────────────┘ │
|
|
740
|
+
│ │ │ │
|
|
741
|
+
│ │ No │ Yes │
|
|
742
|
+
│ ▼ ▼ │
|
|
743
|
+
│ ┌──────────────┐ ┌──────────────┐ │
|
|
744
|
+
│ │ 5. REFINE │ │ 7. RETURN │ │
|
|
745
|
+
│ │ Apply MQM │ │ Final │ │
|
|
746
|
+
│ │ error fixes│ │ Result │ │
|
|
747
|
+
│ └──────┬───────┘ └──────────────┘ │
|
|
748
|
+
│ │ │
|
|
749
|
+
│ └──────────────────────────────────┐ │
|
|
750
|
+
│ Loop back to EVALUATE │ │
|
|
751
|
+
└────────────────────────────────────────────┴────────────────┘
|
|
752
|
+
```
|
|
753
|
+
|
|
754
|
+
### 7.1.1 Pre-Translation Analysis (MAPS-Style)
|
|
755
|
+
|
|
756
|
+
Before translation, analyze the source text to identify potential challenges.
|
|
757
|
+
This step is **optional** and can be skipped in `fast` mode.
|
|
758
|
+
|
|
759
|
+
**Analysis Prompt:**
|
|
760
|
+
```
|
|
761
|
+
Analyze this {sourceLang} text before translating to {targetLang}.
|
|
762
|
+
|
|
763
|
+
## Source Text:
|
|
764
|
+
{sourceText}
|
|
765
|
+
|
|
766
|
+
## Glossary Terms Available:
|
|
767
|
+
{glossaryTerms}
|
|
768
|
+
|
|
769
|
+
## Analyze and extract:
|
|
770
|
+
1. **Key Terms**: Important domain-specific terms that need careful translation
|
|
771
|
+
2. **Ambiguous Phrases**: Phrases with multiple possible interpretations
|
|
772
|
+
3. **Cultural References**: Content that may need localization
|
|
773
|
+
4. **Technical Identifiers**: Code, URLs, names that should NOT be translated
|
|
774
|
+
5. **Potential Challenges**: Specific difficulties for this language pair
|
|
775
|
+
|
|
776
|
+
Respond with only a JSON object:
|
|
777
|
+
{
|
|
778
|
+
"keyTerms": [
|
|
779
|
+
{"term": "...", "context": "...", "suggestedTranslation": "..."}
|
|
780
|
+
],
|
|
781
|
+
"ambiguousPhrases": [
|
|
782
|
+
{"phrase": "...", "interpretations": ["...", "..."], "recommendation": "..."}
|
|
783
|
+
],
|
|
784
|
+
"preserveExact": ["code", "urls", "names to keep unchanged"],
|
|
785
|
+
"challenges": ["challenge 1", "challenge 2"],
|
|
786
|
+
"domain": "technical|marketing|legal|general",
|
|
787
|
+
"registerRecommendation": "formal|informal|neutral"
|
|
788
|
+
}
|
|
789
|
+
```
|
|
790
|
+
|
|
791
|
+
**Benefits of Pre-Analysis:**
|
|
792
|
+
- Resolves ambiguity before translation
|
|
793
|
+
- Reduces hallucination
|
|
794
|
+
- Improves consistency for domain-specific content
|
|
795
|
+
- Enables better glossary application
|
|
796
|
+
|
|
797
|
+
### 7.2 Prompt Templates
|
|
798
|
+
|
|
799
|
+
**Initial Translation Prompt:**
|
|
800
|
+
```
|
|
801
|
+
You are a professional translator. Translate the following {sourceLang} text to {targetLang}.
|
|
802
|
+
|
|
803
|
+
## Glossary (MUST use these exact translations):
|
|
804
|
+
{glossaryTerms}
|
|
805
|
+
|
|
806
|
+
## Document Context:
|
|
807
|
+
Purpose: {documentPurpose}
|
|
808
|
+
Style: {styleInstruction}
|
|
809
|
+
Previous content: {previousContext}
|
|
810
|
+
|
|
811
|
+
## Rules:
|
|
812
|
+
1. Apply glossary terms exactly as specified
|
|
813
|
+
2. Preserve all formatting (markdown, HTML tags, code blocks)
|
|
814
|
+
3. Maintain the same tone and style
|
|
815
|
+
4. Do not translate content inside code blocks
|
|
816
|
+
5. Keep URLs, file paths, and technical identifiers unchanged
|
|
817
|
+
|
|
818
|
+
## Source Text:
|
|
819
|
+
{sourceText}
|
|
820
|
+
|
|
821
|
+
## Translation:
|
|
822
|
+
```
|
|
823
|
+
|
|
824
|
+
**Reflection Prompt:**
|
|
825
|
+
```
|
|
826
|
+
Review this translation and provide specific improvement suggestions.
|
|
827
|
+
|
|
828
|
+
## Source ({sourceLang}):
|
|
829
|
+
{sourceText}
|
|
830
|
+
|
|
831
|
+
## Translation ({targetLang}):
|
|
832
|
+
{translatedText}
|
|
833
|
+
|
|
834
|
+
## Glossary Requirements:
|
|
835
|
+
{glossaryTerms}
|
|
836
|
+
|
|
837
|
+
## Evaluate and suggest improvements for:
|
|
838
|
+
1. **Accuracy**: Does the translation convey the exact meaning?
|
|
839
|
+
2. **Glossary Compliance**: Are all glossary terms applied correctly?
|
|
840
|
+
3. **Fluency**: Does it read naturally in {targetLang}?
|
|
841
|
+
4. **Formatting**: Is the structure preserved?
|
|
842
|
+
5. **Consistency**: Are terms translated consistently?
|
|
843
|
+
|
|
844
|
+
Provide a numbered list of specific, actionable suggestions:
|
|
845
|
+
```
|
|
846
|
+
|
|
847
|
+
**Improvement Prompt:**
|
|
848
|
+
```
|
|
849
|
+
Improve this translation based on the following suggestions.
|
|
850
|
+
|
|
851
|
+
## Source ({sourceLang}):
|
|
852
|
+
{sourceText}
|
|
853
|
+
|
|
854
|
+
## Current Translation:
|
|
855
|
+
{currentTranslation}
|
|
856
|
+
|
|
857
|
+
## Improvement Suggestions:
|
|
858
|
+
{suggestions}
|
|
859
|
+
|
|
860
|
+
## Glossary (MUST apply):
|
|
861
|
+
{glossaryTerms}
|
|
862
|
+
|
|
863
|
+
Provide only the improved translation, nothing else:
|
|
864
|
+
```
|
|
865
|
+
|
|
866
|
+
### 7.3 Quality Evaluation (MQM-Based)
|
|
867
|
+
|
|
868
|
+
Quality evaluation uses **Multidimensional Quality Metrics (MQM)** framework for structured error annotation.
|
|
869
|
+
This approach is based on the [TEaR (Translate, Estimate, Refine)](https://arxiv.org/abs/2402.16379) research paper (NAACL 2025).
|
|
870
|
+
|
|
871
|
+
**MQM Error Types:**
|
|
872
|
+
```
|
|
873
|
+
Accuracy
|
|
874
|
+
├── Mistranslation # Incorrect meaning
|
|
875
|
+
├── Omission # Missing content
|
|
876
|
+
├── Addition # Extra content not in source
|
|
877
|
+
└── Untranslated # Source text left unchanged
|
|
878
|
+
|
|
879
|
+
Fluency
|
|
880
|
+
├── Grammar # Grammatical errors
|
|
881
|
+
├── Spelling # Spelling/typos
|
|
882
|
+
├── Register # Inappropriate formality level
|
|
883
|
+
└── Inconsistency # Inconsistent terminology
|
|
884
|
+
|
|
885
|
+
Style
|
|
886
|
+
├── Awkward # Unnatural phrasing
|
|
887
|
+
└── Unidiomatic # Non-native expressions
|
|
888
|
+
```
|
|
889
|
+
|
|
890
|
+
**MQM Severity Weights:**
|
|
891
|
+
| Severity | Weight | Description |
|
|
892
|
+
|----------|--------|-------------|
|
|
893
|
+
| Minor | 1 | Noticeable but doesn't affect understanding |
|
|
894
|
+
| Major | 5 | Affects understanding or usability |
|
|
895
|
+
| Critical | 25 | Completely wrong or unusable |
|
|
896
|
+
|
|
897
|
+
**Score Calculation:**
|
|
898
|
+
```
|
|
899
|
+
score = max(0, 100 - Σ(error_weight))
|
|
900
|
+
```
|
|
901
|
+
|
|
902
|
+
**MQM Evaluation Prompt:**
|
|
903
|
+
```
|
|
904
|
+
Evaluate this translation using MQM (Multidimensional Quality Metrics) framework.
|
|
905
|
+
|
|
906
|
+
## Source ({sourceLang}):
|
|
907
|
+
{sourceText}
|
|
908
|
+
|
|
909
|
+
## Translation ({targetLang}):
|
|
910
|
+
{translatedText}
|
|
911
|
+
|
|
912
|
+
## Glossary Terms (must be applied exactly):
|
|
913
|
+
{glossaryTerms}
|
|
914
|
+
|
|
915
|
+
## Instructions:
|
|
916
|
+
1. Identify all translation errors
|
|
917
|
+
2. Classify each error by type and severity
|
|
918
|
+
3. For each error, provide the span and suggested fix
|
|
919
|
+
4. Severity: "minor" (1 point), "major" (5 points), "critical" (25 points)
|
|
920
|
+
|
|
921
|
+
Respond with only a JSON object:
|
|
922
|
+
{
|
|
923
|
+
"errors": [
|
|
924
|
+
{
|
|
925
|
+
"type": "accuracy/mistranslation",
|
|
926
|
+
"severity": "major",
|
|
927
|
+
"span": "affected text",
|
|
928
|
+
"suggestion": "corrected text",
|
|
929
|
+
"explanation": "brief reason"
|
|
930
|
+
}
|
|
931
|
+
],
|
|
932
|
+
"score": <100 - sum of weights>,
|
|
933
|
+
"summary": "brief overall assessment"
|
|
934
|
+
}
|
|
935
|
+
```
|
|
936
|
+
|
|
937
|
+
**Legacy Simple Evaluation (fallback for fast mode):**
|
|
938
|
+
```
|
|
939
|
+
Rate this translation's quality from 0 to 100.
|
|
940
|
+
|
|
941
|
+
## Source ({sourceLang}):
|
|
942
|
+
{sourceText}
|
|
943
|
+
|
|
944
|
+
## Translation ({targetLang}):
|
|
945
|
+
{translatedText}
|
|
946
|
+
|
|
947
|
+
## Evaluation Criteria:
|
|
948
|
+
- Semantic accuracy (40 points)
|
|
949
|
+
- Fluency and naturalness (25 points)
|
|
950
|
+
- Glossary compliance (20 points)
|
|
951
|
+
- Format preservation (15 points)
|
|
952
|
+
|
|
953
|
+
Respond with only a JSON object:
|
|
954
|
+
{"score": <number>, "breakdown": {"accuracy": <n>, "fluency": <n>, "glossary": <n>, "format": <n>}, "issues": ["issue1", "issue2"]}
|
|
955
|
+
```
|
|
956
|
+
|
|
957
|
+
**Prompt Variables:**
|
|
958
|
+
|
|
959
|
+
| Variable | Source | Description |
|
|
960
|
+
|----------|--------|-------------|
|
|
961
|
+
| `{sourceLang}` | `languages.source` | Source language code |
|
|
962
|
+
| `{targetLang}` | CLI argument or iteration | Target language code |
|
|
963
|
+
| `{glossaryTerms}` | Resolved glossary | Formatted glossary terms for target language |
|
|
964
|
+
| `{documentPurpose}` | `project.purpose` or `--context` | Document context description |
|
|
965
|
+
| `{styleInstruction}` | `languages.styles[targetLang]` | Per-language style instruction (e.g., "경어체", "です・ます調"). Empty if not specified. |
|
|
966
|
+
| `{previousContext}` | Previous chunks | Context from previously translated chunks |
|
|
967
|
+
| `{sourceText}` | Input content | Text to translate |
|
|
968
|
+
| `{translatedText}` | Translation result | Current translation (for reflection/evaluation) |
|
|
969
|
+
|
|
970
|
+
---
|
|
971
|
+
|
|
972
|
+
## 8. File Format Specifications
|
|
973
|
+
|
|
974
|
+
### 8.1 Configuration File (.translaterc.json)
|
|
975
|
+
|
|
976
|
+
```json
|
|
977
|
+
{
|
|
978
|
+
"version": "1.0",
|
|
979
|
+
"project": {
|
|
980
|
+
"name": "Kubernetes Documentation",
|
|
981
|
+
"description": "Official Kubernetes documentation translation",
|
|
982
|
+
"purpose": "Technical documentation for DevOps engineers and developers"
|
|
983
|
+
},
|
|
984
|
+
"languages": {
|
|
985
|
+
"source": "en",
|
|
986
|
+
"targets": ["ko", "ja", "zh-CN"],
|
|
987
|
+
"styles": {
|
|
988
|
+
"ko": "경어체(존댓말)를 사용하세요",
|
|
989
|
+
"ja": "敬語(です・ます調)を使用してください"
|
|
990
|
+
}
|
|
991
|
+
},
|
|
992
|
+
"provider": {
|
|
993
|
+
"default": "claude",
|
|
994
|
+
"model": "claude-sonnet-4-20250514",
|
|
995
|
+
"fallback": ["openai"]
|
|
996
|
+
},
|
|
997
|
+
"quality": {
|
|
998
|
+
"threshold": 85,
|
|
999
|
+
"maxIterations": 4,
|
|
1000
|
+
"evaluationMethod": "llm"
|
|
1001
|
+
},
|
|
1002
|
+
"chunking": {
|
|
1003
|
+
"maxTokens": 1024,
|
|
1004
|
+
"overlapTokens": 150,
|
|
1005
|
+
"preserveStructure": true
|
|
1006
|
+
},
|
|
1007
|
+
"glossary": {
|
|
1008
|
+
"path": "./glossary.json",
|
|
1009
|
+
"strict": false
|
|
1010
|
+
},
|
|
1011
|
+
"paths": {
|
|
1012
|
+
"output": "./docs/{lang}",
|
|
1013
|
+
"cache": "./.translate-cache"
|
|
1014
|
+
},
|
|
1015
|
+
"ignore": [
|
|
1016
|
+
"**/node_modules/**",
|
|
1017
|
+
"**/*.test.md",
|
|
1018
|
+
"**/drafts/**"
|
|
1019
|
+
]
|
|
1020
|
+
}
|
|
1021
|
+
```
|
|
1022
|
+
|
|
1023
|
+
### 8.2 Glossary File (glossary.json)
|
|
1024
|
+
|
|
1025
|
+
```json
|
|
1026
|
+
{
|
|
1027
|
+
"metadata": {
|
|
1028
|
+
"name": "Kubernetes Glossary",
|
|
1029
|
+
"sourceLang": "en",
|
|
1030
|
+
"targetLangs": ["ko", "ja", "zh-CN"],
|
|
1031
|
+
"version": "1.0.0",
|
|
1032
|
+
"domain": "cloud-native"
|
|
1033
|
+
},
|
|
1034
|
+
"terms": [
|
|
1035
|
+
{
|
|
1036
|
+
"source": "pod",
|
|
1037
|
+
"targets": {
|
|
1038
|
+
"ko": "파드",
|
|
1039
|
+
"ja": "ポッド",
|
|
1040
|
+
"zh-CN": "Pod"
|
|
1041
|
+
},
|
|
1042
|
+
"context": "Kubernetes resource unit",
|
|
1043
|
+
"caseSensitive": false
|
|
1044
|
+
},
|
|
1045
|
+
{
|
|
1046
|
+
"source": "deployment",
|
|
1047
|
+
"targets": {
|
|
1048
|
+
"ko": "디플로이먼트",
|
|
1049
|
+
"ja": "デプロイメント",
|
|
1050
|
+
"zh-CN": "部署"
|
|
1051
|
+
},
|
|
1052
|
+
"notes": "Some teams prefer keeping English in certain contexts"
|
|
1053
|
+
},
|
|
1054
|
+
{
|
|
1055
|
+
"source": "kubectl",
|
|
1056
|
+
"targets": {},
|
|
1057
|
+
"doNotTranslate": true
|
|
1058
|
+
},
|
|
1059
|
+
{
|
|
1060
|
+
"source": "service mesh",
|
|
1061
|
+
"targets": {
|
|
1062
|
+
"ko": "서비스 메시",
|
|
1063
|
+
"ja": "サービスメッシュ",
|
|
1064
|
+
"zh-CN": "服务网格"
|
|
1065
|
+
},
|
|
1066
|
+
"context": "Networking concept"
|
|
1067
|
+
},
|
|
1068
|
+
{
|
|
1069
|
+
"source": "container",
|
|
1070
|
+
"targets": {
|
|
1071
|
+
"ko": "컨테이너",
|
|
1072
|
+
"ja": "コンテナ",
|
|
1073
|
+
"zh-CN": "容器"
|
|
1074
|
+
}
|
|
1075
|
+
},
|
|
1076
|
+
{
|
|
1077
|
+
"source": "cluster",
|
|
1078
|
+
"targets": {
|
|
1079
|
+
"ko": "클러스터",
|
|
1080
|
+
"ja": "クラスター",
|
|
1081
|
+
"zh-CN": "集群"
|
|
1082
|
+
}
|
|
1083
|
+
},
|
|
1084
|
+
{
|
|
1085
|
+
"source": "node",
|
|
1086
|
+
"targets": {
|
|
1087
|
+
"ko": "노드",
|
|
1088
|
+
"ja": "ノード",
|
|
1089
|
+
"zh-CN": "节点"
|
|
1090
|
+
},
|
|
1091
|
+
"context": "Kubernetes worker machine"
|
|
1092
|
+
},
|
|
1093
|
+
{
|
|
1094
|
+
"source": "namespace",
|
|
1095
|
+
"targets": {
|
|
1096
|
+
"ko": "네임스페이스",
|
|
1097
|
+
"ja": "ネームスペース",
|
|
1098
|
+
"zh-CN": "命名空间"
|
|
1099
|
+
}
|
|
1100
|
+
},
|
|
1101
|
+
{
|
|
1102
|
+
"source": "ingress",
|
|
1103
|
+
"targets": {
|
|
1104
|
+
"ko": "인그레스",
|
|
1105
|
+
"ja": "Ingress",
|
|
1106
|
+
"zh-CN": "入口"
|
|
1107
|
+
},
|
|
1108
|
+
"doNotTranslateFor": ["ja"]
|
|
1109
|
+
},
|
|
1110
|
+
{
|
|
1111
|
+
"source": "ConfigMap",
|
|
1112
|
+
"targets": {},
|
|
1113
|
+
"doNotTranslate": true,
|
|
1114
|
+
"caseSensitive": true
|
|
1115
|
+
}
|
|
1116
|
+
]
|
|
1117
|
+
}
|
|
1118
|
+
```
|
|
1119
|
+
|
|
1120
|
+
**Glossary Resolution Logic:**
|
|
1121
|
+
|
|
1122
|
+
```typescript
|
|
1123
|
+
// src/services/glossary.ts
|
|
1124
|
+
|
|
1125
|
+
export function resolveGlossary(
|
|
1126
|
+
glossary: Glossary,
|
|
1127
|
+
targetLang: string
|
|
1128
|
+
): ResolvedGlossary {
|
|
1129
|
+
return {
|
|
1130
|
+
metadata: {
|
|
1131
|
+
...glossary.metadata,
|
|
1132
|
+
targetLang,
|
|
1133
|
+
},
|
|
1134
|
+
terms: glossary.terms.map(term => ({
|
|
1135
|
+
source: term.source,
|
|
1136
|
+
target: resolveTarget(term, targetLang),
|
|
1137
|
+
context: term.context,
|
|
1138
|
+
caseSensitive: term.caseSensitive ?? false,
|
|
1139
|
+
doNotTranslate: resolveDoNotTranslate(term, targetLang),
|
|
1140
|
+
})).filter(term => term.target !== undefined),
|
|
1141
|
+
};
|
|
1142
|
+
}
|
|
1143
|
+
|
|
1144
|
+
function resolveTarget(term: GlossaryTerm, targetLang: string): string {
|
|
1145
|
+
if (term.doNotTranslate) return term.source;
|
|
1146
|
+
if (term.doNotTranslateFor?.includes(targetLang)) return term.source;
|
|
1147
|
+
return term.targets[targetLang] ?? term.source;
|
|
1148
|
+
}
|
|
1149
|
+
|
|
1150
|
+
function resolveDoNotTranslate(term: GlossaryTerm, targetLang: string): boolean {
|
|
1151
|
+
return term.doNotTranslate || term.doNotTranslateFor?.includes(targetLang) || false;
|
|
1152
|
+
}
|
|
1153
|
+
```
|
|
1154
|
+
|
|
1155
|
+
### 8.3 Cache Structure (.translate-cache/)
|
|
1156
|
+
|
|
1157
|
+
```
|
|
1158
|
+
.translate-cache/
|
|
1159
|
+
├── index.json # Cache index with file hashes
|
|
1160
|
+
├── translations/
|
|
1161
|
+
│ ├── {hash1}.json # Cached translation result
|
|
1162
|
+
│ ├── {hash2}.json
|
|
1163
|
+
│ └── ...
|
|
1164
|
+
└── glossary-compiled.json # Pre-compiled glossary trie
|
|
1165
|
+
```
|
|
1166
|
+
|
|
1167
|
+
**Cache Entry Format:**
|
|
1168
|
+
```json
|
|
1169
|
+
{
|
|
1170
|
+
"sourceHash": "sha256:abc123...",
|
|
1171
|
+
"sourceLang": "en",
|
|
1172
|
+
"targetLang": "ko",
|
|
1173
|
+
"glossaryHash": "sha256:def456...",
|
|
1174
|
+
"translation": "번역된 내용...",
|
|
1175
|
+
"qualityScore": 92,
|
|
1176
|
+
"createdAt": "2024-12-11T10:30:00Z",
|
|
1177
|
+
"provider": "claude",
|
|
1178
|
+
"model": "claude-sonnet-4-20250514"
|
|
1179
|
+
}
|
|
1180
|
+
```
|
|
1181
|
+
|
|
1182
|
+
---
|
|
1183
|
+
|
|
1184
|
+
## 9. Implementation Roadmap
|
|
1185
|
+
|
|
1186
|
+
### Phase 1: MVP
|
|
1187
|
+
|
|
1188
|
+
**Scope:** Single file translation with Claude, basic glossary support
|
|
1189
|
+
|
|
1190
|
+
| Task | Description | Estimate |
|
|
1191
|
+
|------|-------------|----------|
|
|
1192
|
+
| Project setup | TypeScript, tsup, vitest, Commander.js | 2h |
|
|
1193
|
+
| Config loader | Parse .translaterc.json, merge CLI args | 3h |
|
|
1194
|
+
| Claude provider | Implement LLMProvider interface | 4h |
|
|
1195
|
+
| Basic chunker | Fixed-size token chunking | 3h |
|
|
1196
|
+
| Markdown parser | Extract/restore text nodes via remark | 6h |
|
|
1197
|
+
| Translation agent | Initial translate + single refine loop | 8h |
|
|
1198
|
+
| Glossary loader | Parse JSON, build lookup map | 3h |
|
|
1199
|
+
| CLI: file command | Single file translate with stdin/stdout | 4h |
|
|
1200
|
+
| Basic tests | Unit tests for core components | 4h |
|
|
1201
|
+
| **Total** | | **37h** |
|
|
1202
|
+
|
|
1203
|
+
**MVP Deliverable:**
|
|
1204
|
+
```bash
|
|
1205
|
+
cat doc.md | llm-translate -s en -t ko --glossary ./glossary.json
|
|
1206
|
+
```
|
|
1207
|
+
|
|
1208
|
+
### Phase 2: Quality & Providers
|
|
1209
|
+
|
|
1210
|
+
| Task | Description | Estimate |
|
|
1211
|
+
|------|-------------|----------|
|
|
1212
|
+
| OpenAI provider | Implement adapter | 3h |
|
|
1213
|
+
| Ollama provider | Implement adapter | 3h |
|
|
1214
|
+
| Provider registry | Dynamic loading, fallback logic | 4h |
|
|
1215
|
+
| Quality evaluator | LLM-based scoring | 4h |
|
|
1216
|
+
| Iterative refinement | Full Self-Refine loop | 6h |
|
|
1217
|
+
| HTML parser | Cheerio-based processing | 4h |
|
|
1218
|
+
| Semantic chunker | Structure-aware splitting | 6h |
|
|
1219
|
+
| Cache manager | File-based caching with hashing | 4h |
|
|
1220
|
+
| **Total** | | **34h** |
|
|
1221
|
+
|
|
1222
|
+
### Phase 3: Batch & Polish
|
|
1223
|
+
|
|
1224
|
+
| Task | Description | Estimate |
|
|
1225
|
+
|------|-------------|----------|
|
|
1226
|
+
| CLI: dir command | Batch directory processing | 6h |
|
|
1227
|
+
| Parallel processing | Concurrent file translation | 4h |
|
|
1228
|
+
| Progress reporting | CLI progress bar, summary stats | 3h |
|
|
1229
|
+
| CLI: init command | Interactive project setup | 3h |
|
|
1230
|
+
| CLI: glossary commands | add, remove, list, validate | 4h |
|
|
1231
|
+
| Error handling | Comprehensive error messages | 4h |
|
|
1232
|
+
| Documentation | README, usage guide | 4h |
|
|
1233
|
+
| Integration tests | End-to-end test suite | 6h |
|
|
1234
|
+
| **Total** | | **34h** |
|
|
1235
|
+
|
|
1236
|
+
### Phase 4: Advanced Features (Future)
|
|
1237
|
+
|
|
1238
|
+
- MCP server implementation
|
|
1239
|
+
- Translation Memory (TMX) support
|
|
1240
|
+
- Embedding-based quality evaluation (COMET)
|
|
1241
|
+
- RAG integration for context retrieval
|
|
1242
|
+
- Watch mode for continuous translation
|
|
1243
|
+
- VS Code extension
|
|
1244
|
+
|
|
1245
|
+
---
|
|
1246
|
+
|
|
1247
|
+
## 10. Testing Strategy
|
|
1248
|
+
|
|
1249
|
+
### 10.1 Unit Tests
|
|
1250
|
+
|
|
1251
|
+
```typescript
|
|
1252
|
+
// tests/chunker.test.ts
|
|
1253
|
+
describe('SemanticChunker', () => {
|
|
1254
|
+
it('should respect maxTokens limit', () => {});
|
|
1255
|
+
it('should preserve code blocks intact', () => {});
|
|
1256
|
+
it('should maintain overlap between chunks', () => {});
|
|
1257
|
+
it('should handle empty input', () => {});
|
|
1258
|
+
});
|
|
1259
|
+
|
|
1260
|
+
// tests/glossary.test.ts
|
|
1261
|
+
describe('GlossaryManager', () => {
|
|
1262
|
+
it('should find exact matches', () => {});
|
|
1263
|
+
it('should handle case sensitivity', () => {});
|
|
1264
|
+
it('should return doNotTranslate terms', () => {});
|
|
1265
|
+
});
|
|
1266
|
+
|
|
1267
|
+
// tests/markdown.test.ts
|
|
1268
|
+
describe('MarkdownParser', () => {
|
|
1269
|
+
it('should extract text nodes', () => {});
|
|
1270
|
+
it('should preserve code blocks', () => {});
|
|
1271
|
+
it('should maintain header hierarchy', () => {});
|
|
1272
|
+
it('should handle nested structures', () => {});
|
|
1273
|
+
});
|
|
1274
|
+
```
|
|
1275
|
+
|
|
1276
|
+
### 10.2 Integration Tests
|
|
1277
|
+
|
|
1278
|
+
```typescript
|
|
1279
|
+
// tests/integration/translation.test.ts
|
|
1280
|
+
describe('Translation Pipeline', () => {
|
|
1281
|
+
it('should translate markdown with glossary enforcement', async () => {
|
|
1282
|
+
const result = await translate({
|
|
1283
|
+
content: '# Deploy a Pod\nCreate a deployment...',
|
|
1284
|
+
sourceLang: 'en',
|
|
1285
|
+
targetLang: 'ko',
|
|
1286
|
+
glossary: k8sGlossary,
|
|
1287
|
+
});
|
|
1288
|
+
|
|
1289
|
+
expect(result.content).toContain('파드');
|
|
1290
|
+
expect(result.content).toContain('디플로이먼트');
|
|
1291
|
+
expect(result.metadata.qualityScore).toBeGreaterThan(80);
|
|
1292
|
+
});
|
|
1293
|
+
});
|
|
1294
|
+
```
|
|
1295
|
+
|
|
1296
|
+
### 10.3 Test Fixtures
|
|
1297
|
+
|
|
1298
|
+
```
|
|
1299
|
+
tests/fixtures/
|
|
1300
|
+
├── input/
|
|
1301
|
+
│ ├── simple.md
|
|
1302
|
+
│ ├── with-code-blocks.md
|
|
1303
|
+
│ ├── complex-structure.md
|
|
1304
|
+
│ └── html-document.html
|
|
1305
|
+
├── expected/
|
|
1306
|
+
│ ├── simple.ko.md
|
|
1307
|
+
│ ├── with-code-blocks.ko.md
|
|
1308
|
+
│ └── ...
|
|
1309
|
+
└── glossaries/
|
|
1310
|
+
├── k8s-en-ko.json
|
|
1311
|
+
└── general-en-ja.json
|
|
1312
|
+
```
|
|
1313
|
+
|
|
1314
|
+
---
|
|
1315
|
+
|
|
1316
|
+
## 11. Error Handling
|
|
1317
|
+
|
|
1318
|
+
### 11.1 Error Types
|
|
1319
|
+
|
|
1320
|
+
```typescript
|
|
1321
|
+
// src/errors.ts
|
|
1322
|
+
|
|
1323
|
+
export class TranslationError extends Error {
|
|
1324
|
+
constructor(
|
|
1325
|
+
message: string,
|
|
1326
|
+
public code: ErrorCode,
|
|
1327
|
+
public details?: Record<string, unknown>
|
|
1328
|
+
) {
|
|
1329
|
+
super(message);
|
|
1330
|
+
this.name = 'TranslationError';
|
|
1331
|
+
}
|
|
1332
|
+
}
|
|
1333
|
+
|
|
1334
|
+
export enum ErrorCode {
|
|
1335
|
+
CONFIG_NOT_FOUND = 'CONFIG_NOT_FOUND',
|
|
1336
|
+
CONFIG_INVALID = 'CONFIG_INVALID',
|
|
1337
|
+
GLOSSARY_NOT_FOUND = 'GLOSSARY_NOT_FOUND',
|
|
1338
|
+
GLOSSARY_INVALID = 'GLOSSARY_INVALID',
|
|
1339
|
+
PROVIDER_NOT_FOUND = 'PROVIDER_NOT_FOUND',
|
|
1340
|
+
PROVIDER_AUTH_FAILED = 'PROVIDER_AUTH_FAILED',
|
|
1341
|
+
PROVIDER_RATE_LIMITED = 'PROVIDER_RATE_LIMITED',
|
|
1342
|
+
PROVIDER_ERROR = 'PROVIDER_ERROR',
|
|
1343
|
+
QUALITY_THRESHOLD_NOT_MET = 'QUALITY_THRESHOLD_NOT_MET',
|
|
1344
|
+
FILE_NOT_FOUND = 'FILE_NOT_FOUND',
|
|
1345
|
+
FILE_READ_ERROR = 'FILE_READ_ERROR',
|
|
1346
|
+
FILE_WRITE_ERROR = 'FILE_WRITE_ERROR',
|
|
1347
|
+
UNSUPPORTED_FORMAT = 'UNSUPPORTED_FORMAT',
|
|
1348
|
+
CHUNK_TOO_LARGE = 'CHUNK_TOO_LARGE',
|
|
1349
|
+
}
|
|
1350
|
+
```
|
|
1351
|
+
|
|
1352
|
+
### 11.2 User-Friendly Messages
|
|
1353
|
+
|
|
1354
|
+
```typescript
|
|
1355
|
+
const errorMessages: Record<ErrorCode, string> = {
|
|
1356
|
+
CONFIG_NOT_FOUND: 'Configuration file not found. Run `llm-translate init` to create one.',
|
|
1357
|
+
PROVIDER_AUTH_FAILED: 'Authentication failed. Check your API key in environment variables.',
|
|
1358
|
+
QUALITY_THRESHOLD_NOT_MET: 'Translation quality ({score}) did not meet threshold ({threshold}). Use --quality to adjust or --max-iterations to allow more refinement.',
|
|
1359
|
+
// ...
|
|
1360
|
+
};
|
|
1361
|
+
```
|
|
1362
|
+
|
|
1363
|
+
---
|
|
1364
|
+
|
|
1365
|
+
## 12. Environment Variables
|
|
1366
|
+
|
|
1367
|
+
| Variable | Description | Required |
|
|
1368
|
+
|----------|-------------|----------|
|
|
1369
|
+
| `ANTHROPIC_API_KEY` | Claude API key | If using Claude |
|
|
1370
|
+
| `OPENAI_API_KEY` | OpenAI API key | If using OpenAI |
|
|
1371
|
+
| `OLLAMA_BASE_URL` | Ollama server URL | If using Ollama (default: http://localhost:11434) |
|
|
1372
|
+
| `LLM_TRANSLATE_CONFIG` | Default config file path | No |
|
|
1373
|
+
| `LLM_TRANSLATE_CACHE_DIR` | Cache directory path | No |
|
|
1374
|
+
| `LLM_TRANSLATE_LOG_LEVEL` | Log level (debug/info/warn/error) | No |
|
|
1375
|
+
|
|
1376
|
+
---
|
|
1377
|
+
|
|
1378
|
+
## 13. Success Criteria
|
|
1379
|
+
|
|
1380
|
+
### 13.1 Functional Requirements
|
|
1381
|
+
|
|
1382
|
+
| ID | Requirement | Acceptance Criteria |
|
|
1383
|
+
|----|-------------|---------------------|
|
|
1384
|
+
| FR1 | Glossary enforcement | 100% of glossary terms applied correctly |
|
|
1385
|
+
| FR2 | Format preservation | Markdown/HTML structure identical after translation |
|
|
1386
|
+
| FR3 | Quality threshold | Default 85% quality score achieved |
|
|
1387
|
+
| FR4 | Multi-provider | Claude, OpenAI, Ollama all functional |
|
|
1388
|
+
| FR5 | Batch processing | 100+ files processed without failure |
|
|
1389
|
+
| FR6 | stdin/stdout | Piped input/output works correctly |
|
|
1390
|
+
|
|
1391
|
+
### 13.2 Non-Functional Requirements
|
|
1392
|
+
|
|
1393
|
+
| ID | Requirement | Target |
|
|
1394
|
+
|----|-------------|--------|
|
|
1395
|
+
| NFR1 | Startup time | < 500ms |
|
|
1396
|
+
| NFR2 | Single file (1000 words) | < 30s |
|
|
1397
|
+
| NFR3 | Memory usage | < 256MB for typical workloads |
|
|
1398
|
+
| NFR4 | Test coverage | > 80% |
|
|
1399
|
+
| NFR5 | Documentation | README + CLI help complete |
|
|
1400
|
+
|
|
1401
|
+
---
|
|
1402
|
+
|
|
1403
|
+
## 14. Open Questions
|
|
1404
|
+
|
|
1405
|
+
1. **Quality Metric Selection**: Should we invest in COMET integration for Phase 2, or is LLM-based evaluation sufficient for most use cases?
|
|
1406
|
+
|
|
1407
|
+
2. **Glossary Format**: Support additional formats like TBX/CSV, or keep JSON-only for simplicity?
|
|
1408
|
+
|
|
1409
|
+
3. **Translation Memory**: Priority for TMX support? Would users benefit from leveraging past translations?
|
|
1410
|
+
|
|
1411
|
+
4. **Naming**: Final project name? Candidates: `llm-translate`, `doctrans`, `transforge`, `lexicon`
|
|
1412
|
+
|
|
1413
|
+
5. **Distribution**: npm package only, or also provide standalone binaries via pkg?
|
|
1414
|
+
|
|
1415
|
+
---
|
|
1416
|
+
|
|
1417
|
+
## 15. References
|
|
1418
|
+
|
|
1419
|
+
- [Andrew Ng's translation-agent](https://github.com/andrewyng/translation-agent) - Self-refine pattern reference
|
|
1420
|
+
- [OmegaT Documentation](https://omegat.org/) - Glossary/TM patterns
|
|
1421
|
+
- [Vercel AI SDK](https://sdk.vercel.ai/) - Multi-provider abstraction
|
|
1422
|
+
- [unified/remark](https://unifiedjs.com/) - Markdown processing
|
|
1423
|
+
- [WMT22 Metrics](https://aclanthology.org/2022.wmt-1.2/) - Quality evaluation research
|
|
1424
|
+
- [MCP Specification](https://modelcontextprotocol.io/) - Agent integration
|
|
1425
|
+
|
|
1426
|
+
---
|
|
1427
|
+
|
|
1428
|
+
## Appendix A: Sample Glossary for Testing
|
|
1429
|
+
|
|
1430
|
+
```json
|
|
1431
|
+
{
|
|
1432
|
+
"metadata": {
|
|
1433
|
+
"name": "ML/AI Glossary",
|
|
1434
|
+
"sourceLang": "en",
|
|
1435
|
+
"targetLangs": ["ko", "ja", "zh-CN"],
|
|
1436
|
+
"version": "1.0.0",
|
|
1437
|
+
"domain": "machine-learning"
|
|
1438
|
+
},
|
|
1439
|
+
"terms": [
|
|
1440
|
+
{
|
|
1441
|
+
"source": "machine learning",
|
|
1442
|
+
"targets": {
|
|
1443
|
+
"ko": "머신러닝",
|
|
1444
|
+
"ja": "機械学習",
|
|
1445
|
+
"zh-CN": "机器学习"
|
|
1446
|
+
}
|
|
1447
|
+
},
|
|
1448
|
+
{
|
|
1449
|
+
"source": "artificial intelligence",
|
|
1450
|
+
"targets": {
|
|
1451
|
+
"ko": "인공지능",
|
|
1452
|
+
"ja": "人工知能",
|
|
1453
|
+
"zh-CN": "人工智能"
|
|
1454
|
+
}
|
|
1455
|
+
},
|
|
1456
|
+
{
|
|
1457
|
+
"source": "neural network",
|
|
1458
|
+
"targets": {
|
|
1459
|
+
"ko": "신경망",
|
|
1460
|
+
"ja": "ニューラルネットワーク",
|
|
1461
|
+
"zh-CN": "神经网络"
|
|
1462
|
+
}
|
|
1463
|
+
},
|
|
1464
|
+
{
|
|
1465
|
+
"source": "deep learning",
|
|
1466
|
+
"targets": {
|
|
1467
|
+
"ko": "딥러닝",
|
|
1468
|
+
"ja": "ディープラーニング",
|
|
1469
|
+
"zh-CN": "深度学习"
|
|
1470
|
+
}
|
|
1471
|
+
},
|
|
1472
|
+
{
|
|
1473
|
+
"source": "API",
|
|
1474
|
+
"targets": {},
|
|
1475
|
+
"doNotTranslate": true
|
|
1476
|
+
},
|
|
1477
|
+
{
|
|
1478
|
+
"source": "SDK",
|
|
1479
|
+
"targets": {},
|
|
1480
|
+
"doNotTranslate": true
|
|
1481
|
+
},
|
|
1482
|
+
{
|
|
1483
|
+
"source": "open source",
|
|
1484
|
+
"targets": {
|
|
1485
|
+
"ko": "오픈 소스",
|
|
1486
|
+
"ja": "オープンソース",
|
|
1487
|
+
"zh-CN": "开源"
|
|
1488
|
+
}
|
|
1489
|
+
},
|
|
1490
|
+
{
|
|
1491
|
+
"source": "repository",
|
|
1492
|
+
"targets": {
|
|
1493
|
+
"ko": "저장소",
|
|
1494
|
+
"ja": "リポジトリ",
|
|
1495
|
+
"zh-CN": "仓库"
|
|
1496
|
+
}
|
|
1497
|
+
},
|
|
1498
|
+
{
|
|
1499
|
+
"source": "pull request",
|
|
1500
|
+
"targets": {
|
|
1501
|
+
"ko": "풀 리퀘스트",
|
|
1502
|
+
"ja": "プルリクエスト",
|
|
1503
|
+
"zh-CN": "拉取请求"
|
|
1504
|
+
},
|
|
1505
|
+
"doNotTranslateFor": ["ko", "ja"]
|
|
1506
|
+
},
|
|
1507
|
+
{
|
|
1508
|
+
"source": "merge",
|
|
1509
|
+
"targets": {
|
|
1510
|
+
"ko": "병합",
|
|
1511
|
+
"ja": "マージ",
|
|
1512
|
+
"zh-CN": "合并"
|
|
1513
|
+
}
|
|
1514
|
+
}
|
|
1515
|
+
]
|
|
1516
|
+
}
|
|
1517
|
+
```
|
|
1518
|
+
|
|
1519
|
+
---
|
|
1520
|
+
|
|
1521
|
+
## Appendix B: Example Translation Session
|
|
1522
|
+
|
|
1523
|
+
**Input (docs/guide.md):**
|
|
1524
|
+
```markdown
|
|
1525
|
+
# Getting Started with Machine Learning
|
|
1526
|
+
|
|
1527
|
+
This guide introduces the basics of artificial intelligence and deep learning.
|
|
1528
|
+
|
|
1529
|
+
## Prerequisites
|
|
1530
|
+
|
|
1531
|
+
- Python 3.8+
|
|
1532
|
+
- An API key from OpenAI
|
|
1533
|
+
|
|
1534
|
+
## Installation
|
|
1535
|
+
|
|
1536
|
+
```bash
|
|
1537
|
+
pip install tensorflow
|
|
1538
|
+
```
|
|
1539
|
+
|
|
1540
|
+
Create a neural network model:
|
|
1541
|
+
|
|
1542
|
+
```python
|
|
1543
|
+
model = Sequential([
|
|
1544
|
+
Dense(128, activation='relu'),
|
|
1545
|
+
Dense(10, activation='softmax')
|
|
1546
|
+
])
|
|
1547
|
+
```
|
|
1548
|
+
```
|
|
1549
|
+
|
|
1550
|
+
**Command:**
|
|
1551
|
+
```bash
|
|
1552
|
+
llm-translate file docs/guide.md -s en -t ko \
|
|
1553
|
+
--glossary ./ml-glossary.json \
|
|
1554
|
+
-o docs/ko/guide.md \
|
|
1555
|
+
--verbose
|
|
1556
|
+
```
|
|
1557
|
+
|
|
1558
|
+
**Output (docs/ko/guide.md):**
|
|
1559
|
+
```markdown
|
|
1560
|
+
# 머신러닝 시작하기
|
|
1561
|
+
|
|
1562
|
+
이 가이드는 인공지능과 딥러닝의 기초를 소개합니다.
|
|
1563
|
+
|
|
1564
|
+
## 사전 요구사항
|
|
1565
|
+
|
|
1566
|
+
- Python 3.8+
|
|
1567
|
+
- OpenAI의 API 키
|
|
1568
|
+
|
|
1569
|
+
## 설치
|
|
1570
|
+
|
|
1571
|
+
```bash
|
|
1572
|
+
pip install tensorflow
|
|
1573
|
+
```
|
|
1574
|
+
|
|
1575
|
+
신경망 모델을 생성합니다:
|
|
1576
|
+
|
|
1577
|
+
```python
|
|
1578
|
+
model = Sequential([
|
|
1579
|
+
Dense(128, activation='relu'),
|
|
1580
|
+
Dense(10, activation='softmax')
|
|
1581
|
+
])
|
|
1582
|
+
```
|
|
1583
|
+
```
|
|
1584
|
+
|
|
1585
|
+
**Console Output:**
|
|
1586
|
+
```
|
|
1587
|
+
✓ Loaded glossary: 10 terms
|
|
1588
|
+
✓ Parsed markdown: 4 translatable sections
|
|
1589
|
+
✓ Translation complete
|
|
1590
|
+
- Quality: 94/100
|
|
1591
|
+
- Iterations: 2
|
|
1592
|
+
- Tokens: 847 input, 512 output
|
|
1593
|
+
- Duration: 8.3s
|
|
1594
|
+
✓ Written to docs/ko/guide.md
|
|
1595
|
+
```
|