llm_translate 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b3d7bffb10cabd77729e1e806a256a16bd7a036faa0ea52b7024e3a521dada5e
4
- data.tar.gz: 50fcf8a22940afb311387913d2455dc519812abe8618be54c47541808193afd6
3
+ metadata.gz: 2948aad0cdc839f8e2d8a9caa933f1ba20d63be25618d5cca924e565cbc55c30
4
+ data.tar.gz: bbfd68881a4cfdcb90b16aae2a76288285f9573a0f50df9094c6eec8064da1a0
5
5
  SHA512:
6
- metadata.gz: 76e635e0838f893377ed9ba05e8e4e04c36053cbb4c68b2ab09cbd8c69d8a7433400886e1f817647d9b322fc7dad1e4643cc53b6471fc7ed2b6538df015108d7
7
- data.tar.gz: 713bb859602fd5f511444f8e491bcd556beb7954f2f03cb36ff79938ff1e07e60c5419dd2c9818dc4d1f95136af8d4722b649a267e4217128598a26036b50532
6
+ metadata.gz: ed2d0f24657d7e86f3c8b80fcfebfde202c57461aaef9c25979d7aeffe23c82fe4f9bc78fa8aef01a488bace251e17033c0ec1bd87a6029facf6d3973f035817
7
+ data.tar.gz: 00c43857ac6c811641d00b1ec44684d7bde9447364493dfebecf6f69f24a148549d8ed1f3fc7b65d5500a95abaa7330a1869f5746ebe28fa6e993ca258f6e91d
@@ -0,0 +1,123 @@
1
+ # 文档拆分器功能更新总结
2
+
3
+ ## 📋 更新内容概览
4
+
5
+ 本次更新为 LlmTranslate 添加了智能文档拆分功能,专门用于处理大型 Markdown 文档的翻译。
6
+
7
+ ## 🆕 新增功能
8
+
9
+ ### 1. 文档拆分器 (DocumentSplitter)
10
+ - **位置**: `lib/llm_translate/document_splitter.rb`
11
+ - **功能**: 智能按 Markdown 结构拆分大文档
12
+ - **特点**:
13
+ - 识别标题、代码块、列表等 Markdown 元素边界
14
+ - 避免在重要结构中间分割
15
+ - 自动合并翻译后的片段
16
+
17
+ ### 2. 配置扩展
18
+ - **位置**: `lib/llm_translate/config.rb`
19
+ - **新增配置项**:
20
+ - `enable_splitting`: 启用/禁用文档拆分
21
+ - `max_chars`: 触发拆分的字符数阈值
22
+ - `every_chars`: 每个片段的目标字符数
23
+
24
+ ### 3. 翻译引擎集成
25
+ - **位置**: `lib/llm_translate/translator_engine.rb`
26
+ - **功能**: 自动检测大文档并启用拆分翻译
27
+ - **特点**:
28
+ - 逐片段翻译处理
29
+ - 进度跟踪和日志记录
30
+ - 错误处理和重试机制
31
+
32
+ ## 📝 文档更新
33
+
34
+ ### 1. README.md
35
+ - ✅ 添加文档拆分功能到特性列表
36
+ - ✅ 新增专门的"Document Splitting"章节
37
+ - ✅ 更新配置示例,包含拆分配置
38
+ - ✅ 添加使用示例和日志输出演示
39
+ - ✅ 更新版本更新日志 (v0.2.0)
40
+
41
+ ### 2. 配置文件
42
+ - ✅ **llm_translate.yml**: 添加文档拆分配置项和说明
43
+ - ✅ **large_document_config.yml**: 新增大文档翻译专用配置
44
+
45
+ ## 🎯 核心配置
46
+
47
+ ### 基本配置
48
+ ```yaml
49
+ translation:
50
+ enable_splitting: true # 启用文档拆分
51
+ max_chars: 20000 # 超过 20k 字符时拆分
52
+ every_chars: 18000 # 每片段目标大小
53
+ ```
54
+
55
+ ### 性能优化配置
56
+ ```yaml
57
+ performance:
58
+ concurrent_files: 1 # 拆分时建议单线程
59
+ request_interval: 2 # 片段间延迟 2 秒
60
+ ```
61
+
62
+ ## 🔧 使用场景
63
+
64
+ ### 1. 大型技术文档
65
+ - GitLab 文档 (65k+ 字符)
66
+ - API 文档
67
+ - 用户手册
68
+
69
+ ### 2. 长篇文章
70
+ - 博客文章
71
+ - 教程文档
72
+ - 规范文档
73
+
74
+ ## 📊 性能表现
75
+
76
+ ### 示例: GitLab OIDC 文档
77
+ - **原始大小**: 65,277 字符
78
+ - **拆分结果**: 4 个片段
79
+ - **片段大小**: 18,500 / 19,200 / 17,800 / 9,777 字符
80
+ - **处理方式**: 逐个翻译后自动合并
81
+
82
+ ## 🚀 使用方法
83
+
84
+ ### 1. 基本使用
85
+ ```bash
86
+ llm_translate translate --config ./llm_translate.yml --input ./large_doc.md --output ./large_doc.zh.md
87
+ ```
88
+
89
+ ### 2. 大文档专用配置
90
+ ```bash
91
+ llm_translate translate --config ./large_document_config.yml
92
+ ```
93
+
94
+ ### 3. 日志输出示例
95
+ ```
96
+ [INFO] Document size (65277 chars) exceeds limit, splitting...
97
+ [INFO] Document split into 4 chunks
98
+ [INFO] Translating chunk 1/4 (18500 chars)...
99
+ [INFO] Translating chunk 2/4 (19200 chars)...
100
+ [INFO] Translating chunk 3/4 (17800 chars)...
101
+ [INFO] Translating chunk 4/4 (9777 chars)...
102
+ [INFO] Merging translated chunks...
103
+ [INFO] Translation completed successfully!
104
+ ```
105
+
106
+ ## ✨ 主要优势
107
+
108
+ 1. **无缝集成**: 自动检测,无需手动干预
109
+ 2. **格式保持**: 完整保持 Markdown 结构
110
+ 3. **智能拆分**: 按语义边界拆分,不破坏内容
111
+ 4. **错误容错**: 单片段失败不影响整体处理
112
+ 5. **进度可视**: 详细的处理进度和状态信息
113
+
114
+ ## 📋 测试验证
115
+
116
+ - ✅ 语法检查通过
117
+ - ✅ 功能测试完成
118
+ - ✅ 大文档处理验证
119
+ - ✅ 配置文件验证
120
+
121
+ ## 🎉 版本发布
122
+
123
+ 此功能作为 **v0.2.0** 版本的主要特性发布,为用户提供了处理大型文档的强大能力。
data/README.md CHANGED
@@ -6,6 +6,7 @@ AI-powered Markdown translator that preserves formatting while translating conte
6
6
 
7
7
  - 🤖 **AI-Powered Translation**: Support for OpenAI, Anthropic, and Ollama
8
8
  - 📝 **Markdown Format Preservation**: Keeps code blocks, links, images, and formatting intact
9
+ - 📄 **Document Splitting**: Intelligent splitting of large documents for optimal translation
9
10
  - 🔧 **Flexible Configuration**: YAML-based configuration with environment variable support
10
11
  - 📁 **Batch Processing**: Recursively processes entire directory structures
11
12
  - 🚀 **CLI Interface**: Easy-to-use command-line interface with Thor
@@ -72,6 +73,13 @@ ai:
72
73
 
73
74
  translation:
74
75
  target_language: "zh-CN"
76
+ preserve_formatting: true
77
+
78
+ # Document Splitting Configuration
79
+ enable_splitting: true
80
+ max_chars: 20000
81
+ every_chars: 20000
82
+
75
83
  default_prompt: |
76
84
  Please translate the following Markdown content to Chinese, keeping all formatting intact:
77
85
  - Preserve code blocks, links, images, and other Markdown syntax
@@ -117,6 +125,54 @@ ai:
117
125
  # Set OLLAMA_HOST environment variable if not using default
118
126
  ```
119
127
 
128
+ ## Document Splitting
129
+
130
+ For large documents that exceed token limits or need more manageable processing, the translator includes an intelligent document splitting feature.
131
+
132
+ ### How It Works
133
+
134
+ 1. **Automatic Detection**: When a document exceeds the configured `max_chars` threshold, splitting is automatically triggered
135
+ 2. **Smart Splitting**: Documents are split at natural Markdown boundaries (headers, code blocks, lists, etc.)
136
+ 3. **Individual Translation**: Each chunk is translated separately with proper context
137
+ 4. **Seamless Merging**: Translated chunks are automatically merged back into a complete document
138
+
139
+ ### Configuration
140
+
141
+ ```yaml
142
+ translation:
143
+ # Enable document splitting
144
+ enable_splitting: true
145
+
146
+ # Trigger splitting when document exceeds this character count
147
+ max_chars: 20000
148
+
149
+ # Target size for each chunk
150
+ every_chars: 20000
151
+ ```
152
+
153
+ ### Benefits
154
+
155
+ - **Large Document Support**: Handle documents of any size without token limit issues
156
+ - **Better Translation Quality**: Smaller chunks allow for more focused translation
157
+ - **Format Preservation**: Maintains Markdown structure across splits
158
+ - **Automatic Processing**: No manual intervention required
159
+
160
+ ### Example
161
+
162
+ ```bash
163
+ # Translate a large GitLab documentation file (65,000+ characters)
164
+ llm_translate translate --config ./config.yml --input ./large_doc.md --output ./large_doc.zh.md
165
+
166
+ # Output:
167
+ # [INFO] Document size (65277 chars) exceeds limit, splitting...
168
+ # [INFO] Document split into 4 chunks
169
+ # [INFO] Translating chunk 1/4 (18500 chars)...
170
+ # [INFO] Translating chunk 2/4 (19200 chars)...
171
+ # [INFO] Translating chunk 3/4 (17800 chars)...
172
+ # [INFO] Translating chunk 4/4 (9777 chars)...
173
+ # [INFO] Merging translated chunks...
174
+ ```
175
+
120
176
  ## Usage
121
177
 
122
178
  ### Basic Translation
@@ -177,6 +233,11 @@ translation:
177
233
  default_prompt: "Your custom prompt with {content} placeholder"
178
234
  preserve_formatting: true
179
235
  translate_code_comments: false
236
+
237
+ # Document Splitting Settings
238
+ enable_splitting: true # Enable document splitting for large files
239
+ max_chars: 20000 # Trigger splitting when document exceeds this size
240
+ every_chars: 20000 # Target size for each chunk
180
241
 
181
242
  # File Processing
182
243
  files:
@@ -287,6 +348,14 @@ The gem is available as open source under the terms of the [MIT License](https:/
287
348
 
288
349
  ## Changelog
289
350
 
351
+ ### v0.2.0
352
+ - **NEW**: Document Splitting feature for large files
353
+ - **NEW**: Intelligent Markdown-aware splitting at natural boundaries
354
+ - **NEW**: Automatic chunk translation and merging
355
+ - **IMPROVED**: Better handling of large documents (65k+ characters)
356
+ - **IMPROVED**: Enhanced configuration options for document processing
357
+ - **IMPROVED**: Optimized performance settings for split document workflows
358
+
290
359
  ### v0.1.0
291
360
  - Initial release
292
361
  - Support for OpenAI, Anthropic, and Ollama providers
@@ -0,0 +1,146 @@
1
+ # 大文档翻译专用配置
2
+ # 适用于处理超过 20,000 字符的大型文档
3
+
4
+ # AI 模型配置
5
+ ai:
6
+ # API 密钥
7
+ api_key: ${LLM_TRANSLATE_API_KEY}
8
+
9
+ # API 主机地址
10
+ host: https://aihubmix.com
11
+
12
+ # 模型提供商
13
+ provider: "claude"
14
+
15
+ # 模型名称 - 使用支持大上下文的模型
16
+ model: "claude-3-7-sonnet-20250219"
17
+
18
+ # 模型参数
19
+ temperature: 0.3
20
+ max_tokens: 40000 # 增大 token 限制
21
+
22
+ # 请求重试配置
23
+ retry_attempts: 3
24
+ retry_delay: 3 # 增加重试延迟
25
+
26
+ # 请求超时时间
27
+ timeout: 120 # 增加超时时间
28
+
29
+ # 翻译配置
30
+ translation:
31
+ # 默认翻译 prompt
32
+ default_prompt: |
33
+ 请将以下 Markdown 内容翻译为中文,保持所有格式不变:
34
+ - 保留代码块、链接、图片等 Markdown 语法
35
+ - 保留英文的专业术语和产品名称
36
+ - 确保翻译自然流畅
37
+ - 注意:这是一个大文档的片段,请保持翻译的一致性
38
+
39
+ 内容:
40
+ {content}
41
+
42
+ # 目标语言
43
+ target_language: "zh-CN"
44
+
45
+ # 源语言(auto 为自动检测)
46
+ source_language: "auto"
47
+
48
+ # 是否保留原文格式
49
+ preserve_formatting: true
50
+
51
+ # 是否翻译代码注释
52
+ translate_code_comments: false
53
+
54
+ # 文档拆分配置 - 针对大文档优化
55
+ enable_splitting: true
56
+
57
+ # 触发拆分的最大字符数
58
+ max_chars: 20000
59
+
60
+ # 每个片段的目标字符数(略小于 max_chars 提供缓冲)
61
+ every_chars: 18000
62
+
63
+ # 文件处理配置
64
+ files:
65
+ # 单文件模式示例
66
+ input_file: "./large_document.md"
67
+ output_file: "./large_document.zh.md"
68
+
69
+ # 文件覆盖策略
70
+ overwrite_policy: "overwrite" # 直接覆盖,适合大文档处理
71
+
72
+ # 日志配置
73
+ logging:
74
+ # 日志级别 - 使用 info 查看拆分进度
75
+ level: "info"
76
+
77
+ # 日志输出位置
78
+ output: "both" # 同时输出到控制台和文件
79
+
80
+ # 日志文件路径
81
+ file_path: "./logs/large_doc_translation.log"
82
+
83
+ # 记录详细的翻译过程
84
+ verbose_translation: true
85
+
86
+ # 错误日志文件
87
+ error_log_path: "./logs/large_doc_errors.log"
88
+
89
+ # 错误处理配置
90
+ error_handling:
91
+ # 遇到错误时的行为
92
+ on_error: "log_and_continue"
93
+
94
+ # 最大连续错误数
95
+ max_consecutive_errors: 3 # 大文档处理时更严格
96
+
97
+ # 错误重试次数
98
+ retry_on_failure: 3 # 增加重试次数
99
+
100
+ # 生成错误报告
101
+ generate_error_report: true
102
+ error_report_path: "./logs/large_doc_error_report.md"
103
+
104
+ # 性能配置 - 针对大文档优化
105
+ performance:
106
+ # 并发处理文件数 - 大文档拆分时使用单线程
107
+ concurrent_files: 1
108
+
109
+ # 请求间隔 - 避免 API 限流,特别重要
110
+ request_interval: 2 # 增加间隔时间
111
+
112
+ # 内存使用限制
113
+ max_memory_mb: 1000 # 增加内存限制
114
+
115
+ # 输出配置
116
+ output:
117
+ # 显示进度条
118
+ show_progress: true
119
+
120
+ # 显示翻译统计
121
+ show_statistics: true
122
+
123
+ # 生成翻译报告
124
+ generate_report: true
125
+ report_path: "./reports/large_doc_translation_report.md"
126
+
127
+ # 输出格式
128
+ format: "markdown"
129
+
130
+ # 保留元数据
131
+ include_metadata: true
132
+
133
+ # 使用说明:
134
+ # 1. 设置环境变量:export LLM_TRANSLATE_API_KEY="your-api-key"
135
+ # 2. 修改 input_file 和 output_file 路径
136
+ # 3. 运行命令:llm_translate translate --config ./large_document_config.yml
137
+ #
138
+ # 示例输出:
139
+ # [INFO] Document size (65277 chars) exceeds limit, splitting...
140
+ # [INFO] Document split into 4 chunks
141
+ # [INFO] Translating chunk 1/4 (18500 chars)...
142
+ # [INFO] Translating chunk 2/4 (19200 chars)...
143
+ # [INFO] Translating chunk 3/4 (17800 chars)...
144
+ # [INFO] Translating chunk 4/4 (9777 chars)...
145
+ # [INFO] Merging translated chunks...
146
+ # [INFO] Translation completed successfully!
@@ -39,7 +39,7 @@ module LlmTranslate
39
39
  end
40
40
 
41
41
  def max_tokens
42
- data.dig('ai', 'max_tokens') || 4000
42
+ data.dig('ai', 'max_tokens') || 40_000
43
43
  end
44
44
 
45
45
  def retry_attempts
@@ -75,6 +75,19 @@ module LlmTranslate
75
75
  data.dig('translation', 'translate_code_comments') == true
76
76
  end
77
77
 
78
+ # Document Splitting Configuration
79
+ def max_chars_for_splitting
80
+ data.dig('translation', 'max_chars') || 20_000
81
+ end
82
+
83
+ def split_every_chars
84
+ data.dig('translation', 'every_chars') || 20_000
85
+ end
86
+
87
+ def enable_document_splitting?
88
+ data.dig('translation', 'enable_splitting') != false
89
+ end
90
+
78
91
  # File Configuration
79
92
  def input_directory
80
93
  cli_options[:input] || data.dig('files', 'input_directory') || './docs'
@@ -85,15 +98,24 @@ module LlmTranslate
85
98
  end
86
99
 
87
100
  def input_file
101
+ return cli_options[:input] if cli_options[:input]
102
+
88
103
  data.dig('files', 'input_file')
89
104
  end
90
105
 
91
106
  def output_file
107
+ return cli_options[:output] if cli_options[:input] && cli_options[:output]
108
+
92
109
  data.dig('files', 'output_file')
93
110
  end
94
111
 
95
112
  def single_file_mode?
96
- !input_file.nil? && !output_file.nil?
113
+ input_file_path = input_file
114
+ output_file_path = output_file
115
+
116
+ # Both must be present and input must be a file (not directory) for single file mode
117
+ !input_file_path.nil? && !output_file_path.nil? &&
118
+ File.exist?(input_file_path) && File.file?(input_file_path)
97
119
  end
98
120
 
99
121
  def filename_strategy
@@ -196,9 +218,46 @@ module LlmTranslate
196
218
  'API key is required. Set LLM_TRANSLATE_API_KEY environment variable or configure in config file.'
197
219
  end
198
220
 
199
- return if Dir.exist?(File.dirname(input_directory))
221
+ # Validate input/output based on mode
222
+ if single_file_mode?
223
+ validate_single_file_mode
224
+ else
225
+ validate_directory_mode
226
+ end
227
+ end
228
+
229
+ def validate_single_file_mode
230
+ # Validate input file exists
231
+ unless input_file && File.exist?(input_file)
232
+ raise ConfigurationError, "Input file does not exist: #{input_file || 'not specified'}"
233
+ end
234
+
235
+ # Validate input is actually a file
236
+ raise ConfigurationError, "Input path is not a file: #{input_file}" unless File.file?(input_file)
237
+
238
+ # Validate output file path
239
+ raise ConfigurationError, 'Output file must be specified for single file mode' unless output_file
240
+
241
+ # Ensure output directory exists
242
+ output_dir = File.dirname(output_file)
243
+ return if Dir.exist?(output_dir)
244
+
245
+ begin
246
+ FileUtils.mkdir_p(output_dir)
247
+ rescue StandardError => e
248
+ raise ConfigurationError, "Cannot create output directory #{output_dir}: #{e.message}"
249
+ end
250
+ end
251
+
252
+ def validate_directory_mode
253
+ # Validate input directory
254
+ raise ConfigurationError, "Input directory does not exist: #{input_directory}" unless Dir.exist?(input_directory)
255
+
256
+ # Ensure output directory parent exists
257
+ output_parent = File.dirname(output_directory)
258
+ return if Dir.exist?(output_parent)
200
259
 
201
- raise ConfigurationError, "Input directory parent does not exist: #{File.dirname(input_directory)}"
260
+ raise ConfigurationError, "Output directory parent does not exist: #{output_parent}"
202
261
  end
203
262
 
204
263
  def resolve_env_var(value)
@@ -0,0 +1,157 @@
1
+ # frozen_string_literal: true
2
+
3
+ module LlmTranslate
4
+ class DocumentSplitter
5
+ attr_reader :config, :logger
6
+
7
+ def initialize(config, logger = nil)
8
+ @config = config
9
+ @logger = logger || Logger.new($stdout, level: :info)
10
+ end
11
+
12
+ # 拆分文档为多个片段
13
+ def split_document(content)
14
+ return [content] unless should_split?(content)
15
+
16
+ logger.info "Document size (#{content.length} chars) exceeds limit, splitting..."
17
+
18
+ sections = extract_markdown_sections(content)
19
+ chunks = build_chunks(sections)
20
+
21
+ logger.info "Document split into #{chunks.length} chunks"
22
+ chunks
23
+ end
24
+
25
+ # 合并翻译后的文档片段
26
+ def merge_translated_chunks(translated_chunks)
27
+ return translated_chunks.first if translated_chunks.length == 1
28
+
29
+ logger.info "Merging #{translated_chunks.length} translated chunks..."
30
+
31
+ # 简单合并,用双换行连接
32
+ merged_content = translated_chunks.join("\n\n")
33
+
34
+ # 清理多余的空行
35
+ clean_merged_content(merged_content)
36
+ end
37
+
38
+ private
39
+
40
+ def should_split?(content)
41
+ content.length > config.max_chars_for_splitting
42
+ end
43
+
44
+ def extract_markdown_sections(content)
45
+ sections = []
46
+ current_section = ''
47
+ lines = content.split("\n")
48
+
49
+ lines.each do |line|
50
+ # 检查是否是新的段落开始(标题、空行后的内容等)
51
+ if is_section_boundary?(line, current_section) && !current_section.strip.empty?
52
+ sections << current_section.strip
53
+ current_section = ''
54
+ end
55
+
56
+ current_section += "#{line}\n"
57
+ end
58
+
59
+ # 添加最后一个段落
60
+ sections << current_section.strip unless current_section.strip.empty?
61
+
62
+ sections
63
+ end
64
+
65
+ def is_section_boundary?(line, current_section)
66
+ return false if current_section.strip.empty?
67
+
68
+ # 标题行
69
+ return true if line.start_with?('#') && line.match?(/^#+\s+/)
70
+
71
+ # 代码块开始/结束
72
+ return true if line.match?(/^```/)
73
+
74
+ # 列表项
75
+ return true if line.match?(/^\s*[-*+]\s+/) || line.match?(/^\s*\d+\.\s+/)
76
+
77
+ # 引用块
78
+ return true if line.match?(/^>\s+/)
79
+
80
+ # 水平分割线
81
+ return true if line.match?(/^[-*_]{3,}$/)
82
+
83
+ # 表格行
84
+ return true if line.match?(/^\|.*\|$/)
85
+
86
+ # 空行后的非空行(新段落)
87
+ return true if current_section.end_with?("\n\n") && !line.strip.empty?
88
+
89
+ false
90
+ end
91
+
92
+ def build_chunks(sections)
93
+ chunks = []
94
+ current_chunk = ''
95
+
96
+ sections.each do |section|
97
+ # 如果单个段落就超过限制,需要强制拆分
98
+ if section.length > config.split_every_chars
99
+ # 保存当前块
100
+ chunks << current_chunk.strip unless current_chunk.strip.empty?
101
+
102
+ # 强制拆分长段落
103
+ forced_chunks = force_split_section(section)
104
+ chunks.concat(forced_chunks)
105
+
106
+ current_chunk = ''
107
+ next
108
+ end
109
+
110
+ # 检查添加这个段落是否会超过限制
111
+ potential_length = current_chunk.length + section.length + 2 # +2 for "\n\n"
112
+
113
+ if potential_length > config.split_every_chars && !current_chunk.strip.empty?
114
+ # 保存当前块并开始新块
115
+ chunks << current_chunk.strip
116
+ current_chunk = "#{section}\n\n"
117
+ else
118
+ # 添加到当前块
119
+ current_chunk += "#{section}\n\n"
120
+ end
121
+ end
122
+
123
+ # 添加最后一个块
124
+ chunks << current_chunk.strip unless current_chunk.strip.empty?
125
+
126
+ chunks
127
+ end
128
+
129
+ def force_split_section(section)
130
+ chunks = []
131
+ lines = section.split("\n")
132
+ current_chunk = ''
133
+
134
+ lines.each do |line|
135
+ potential_length = current_chunk.length + line.length + 1 # +1 for "\n"
136
+
137
+ if potential_length > config.split_every_chars && !current_chunk.strip.empty?
138
+ chunks << current_chunk.strip
139
+ current_chunk = "#{line}\n"
140
+ else
141
+ current_chunk += "#{line}\n"
142
+ end
143
+ end
144
+
145
+ chunks << current_chunk.strip unless current_chunk.strip.empty?
146
+ chunks
147
+ end
148
+
149
+ def clean_merged_content(content)
150
+ # 移除多余的空行(超过2个连续换行的情况)
151
+ cleaned = content.gsub(/\n{3,}/, "\n\n")
152
+
153
+ # 确保文档以单个换行结尾
154
+ "#{cleaned.strip}\n"
155
+ end
156
+ end
157
+ end
@@ -3,16 +3,18 @@
3
3
  require 'pathname'
4
4
  require 'fileutils'
5
5
  require 'async'
6
+ require_relative 'document_splitter'
6
7
 
7
8
  module LlmTranslate
8
9
  class TranslatorEngine
9
- attr_reader :config, :logger, :ai_client, :file_finder
10
+ attr_reader :config, :logger, :ai_client, :file_finder, :document_splitter
10
11
 
11
12
  def initialize(config, logger, ai_client)
12
13
  @config = config
13
14
  @logger = logger
14
15
  @ai_client = ai_client
15
16
  @file_finder = FileFinder.new(config, logger)
17
+ @document_splitter = DocumentSplitter.new(config, logger)
16
18
  end
17
19
 
18
20
  def translate_file(input_path)
@@ -115,7 +117,10 @@ module LlmTranslate
115
117
  end
116
118
 
117
119
  def translate_content(content, file_path = nil)
118
- if config.preserve_formatting?
120
+ # 检查是否需要启用文档拆分
121
+ if config.enable_document_splitting? && content.length > config.max_chars_for_splitting
122
+ translate_with_document_splitting(content, file_path)
123
+ elsif config.preserve_formatting?
119
124
  translate_with_format_preservation(content)
120
125
  else
121
126
  ai_client.translate(content)
@@ -151,5 +156,40 @@ module LlmTranslate
151
156
  # Translate the content with placeholders
152
157
  ai_client.translate(content)
153
158
  end
159
+
160
+ def translate_with_document_splitting(content, file_path = nil)
161
+ logger.info "Document splitting enabled for large content#{file_path ? " from #{file_path}" : ''}"
162
+
163
+ # 拆分文档
164
+ chunks = document_splitter.split_document(content)
165
+
166
+ logger.info "Translating #{chunks.length} chunks..."
167
+
168
+ # 翻译每个片段
169
+ translated_chunks = []
170
+ chunks.each_with_index do |chunk, index|
171
+ logger.info "Translating chunk #{index + 1}/#{chunks.length} (#{chunk.length} chars)..."
172
+
173
+ begin
174
+ translated_chunk = if config.preserve_formatting?
175
+ translate_with_format_preservation(chunk)
176
+ else
177
+ ai_client.translate(chunk)
178
+ end
179
+
180
+ translated_chunks << translated_chunk
181
+
182
+ # 添加请求间隔延迟
183
+ sleep(config.request_interval) if config.request_interval.positive? && index < chunks.length - 1
184
+ rescue StandardError => e
185
+ logger.error "Failed to translate chunk #{index + 1}: #{e.message}"
186
+ raise e
187
+ end
188
+ end
189
+
190
+ # 合并翻译后的片段
191
+ logger.info 'Merging translated chunks...'
192
+ document_splitter.merge_translated_chunks(translated_chunks)
193
+ end
154
194
  end
155
195
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module LlmTranslate
4
- VERSION = '0.3.0'
4
+ VERSION = '0.5.0'
5
5
  end
data/llm_translate.yml CHANGED
@@ -50,6 +50,15 @@ translation:
50
50
  # 是否翻译代码注释
51
51
  translate_code_comments: false
52
52
 
53
+ # 文档拆分配置
54
+ # 当文档字符数超过 max_chars 时,自动启用拆分功能
55
+ enable_splitting: true
56
+
57
+ # 触发拆分的最大字符数
58
+ max_chars: 20000
59
+
60
+ # 每个片段的目标字符数
61
+ every_chars: 18000
53
62
 
54
63
 
55
64
  # 文件处理配置
@@ -125,13 +134,13 @@ error_handling:
125
134
 
126
135
  # 性能配置
127
136
  performance:
128
- # 并发处理文件数
137
+ # 并发处理文件数(使用文档拆分时建议设为 1)
129
138
  concurrent_files: 3
130
139
 
131
140
  # 批处理大小(同时翻译的文件数)
132
141
  batch_size: 5
133
142
 
134
- # 请求间隔(避免 API 限流)
143
+ # 请求间隔(避免 API 限流,拆分文档时特别重要)
135
144
  request_interval: 1 # 秒
136
145
 
137
146
  # 内存使用限制
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: llm_translate
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.5.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - LlmTranslate Team
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2025-09-01 00:00:00.000000000 Z
11
+ date: 2025-09-02 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: async
@@ -103,6 +103,7 @@ extensions: []
103
103
  extra_rdoc_files: []
104
104
  files:
105
105
  - ".rspec_status"
106
+ - DOCUMENT_SPLITTER_SUMMARY.md
106
107
  - README.md
107
108
  - README.zh.md
108
109
  - Rakefile
@@ -112,21 +113,20 @@ files:
112
113
  - content/prompt.md
113
114
  - content/todo.md
114
115
  - exe/llm_translate
116
+ - large_document_config.yml
115
117
  - lib/llm_translate.rb
116
118
  - lib/llm_translate/ai_client.rb
117
119
  - lib/llm_translate/cli.rb
118
120
  - lib/llm_translate/config.rb
121
+ - lib/llm_translate/document_splitter.rb
119
122
  - lib/llm_translate/file_finder.rb
120
123
  - lib/llm_translate/logger.rb
121
124
  - lib/llm_translate/translator_engine.rb
122
125
  - lib/llm_translate/version.rb
123
126
  - llm_translate.gemspec
124
127
  - llm_translate.yml
125
- - test_config.yml
126
128
  - test_docs/sample.md
127
129
  - test_docs_translated/sample.zh.md
128
- - test_llm_translate.yml
129
- - test_new_config.yml
130
130
  homepage: https://github.com/tianlu1677/llm_translate
131
131
  licenses:
132
132
  - MIT
data/test_config.yml DELETED
@@ -1,52 +0,0 @@
1
- # Test llm_translate configuration
2
- ai:
3
- api_key: ${LLM_TRANSLATE_API_KEY}
4
- provider: "openai"
5
- model: "gpt-4"
6
- temperature: 0.3
7
- max_tokens: 4000
8
- retry_attempts: 3
9
- retry_delay: 2
10
- timeout: 60
11
-
12
- translation:
13
- target_language: "zh-CN"
14
- default_prompt: |
15
- Please translate the following Markdown content to Chinese, keeping all formatting intact:
16
- - Preserve code blocks, links, images, and other Markdown syntax
17
- - Keep English technical terms and product names
18
- - Ensure natural and fluent translation
19
-
20
- Content:
21
- {content}
22
-
23
- files:
24
- input_directory: "./test_docs"
25
- output_directory: "./test_docs_translated"
26
- filename_suffix: ".zh"
27
- include_patterns:
28
- - "**/*.md"
29
- - "**/*.markdown"
30
- exclude_patterns: []
31
- preserve_directory_structure: true
32
- overwrite_policy: "overwrite"
33
-
34
- logging:
35
- level: "info"
36
- output: "console"
37
- verbose_translation: true
38
-
39
- error_handling:
40
- on_error: "log_and_continue"
41
- max_consecutive_errors: 5
42
- retry_on_failure: 2
43
- generate_error_report: true
44
-
45
- performance:
46
- concurrent_files: 1
47
- request_interval: 1
48
-
49
- output:
50
- show_progress: true
51
- show_statistics: true
52
- generate_report: true
@@ -1,176 +0,0 @@
1
- # translator.yml - 翻译工具配置文件
2
-
3
- # AI 模型配置
4
- ai:
5
- # API 密钥(建议使用环境变量 LLM_TRANSLATE_API_KEY)
6
- api_key: ${LLM_TRANSLATE_API_KEY}
7
-
8
- # 模型提供商(openai, anthropic, ollama 等)
9
- provider: "openai"
10
-
11
- # 模型名称
12
- model: "gpt-4"
13
-
14
- # 模型参数
15
- temperature: 0.3
16
- max_tokens: 4000
17
- top_p: 1.0
18
-
19
- # 请求重试配置
20
- retry_attempts: 3
21
- retry_delay: 2 # 秒
22
-
23
- # 请求超时时间
24
- timeout: 60 # 秒
25
-
26
- # 翻译配置
27
- translation:
28
- # 默认翻译 prompt
29
- default_prompt: |
30
- 请将以下 Markdown 内容翻译为中文,保持所有格式不变:
31
- - 保留代码块、链接、图片等 Markdown 语法
32
- - 保留英文的专业术语和产品名称
33
- - 确保翻译自然流畅
34
-
35
- 内容:
36
- {content}
37
-
38
- # 目标语言
39
- target_language: "zh-CN"
40
-
41
- # 源语言(auto 为自动检测)
42
- source_language: "auto"
43
-
44
- # 是否保留原文格式
45
- preserve_formatting: true
46
-
47
- # 是否翻译代码注释
48
- translate_code_comments: false
49
-
50
- # 需要保留不翻译的内容模式
51
-
52
-
53
- # 文件处理配置
54
- files:
55
- # 输入目录
56
- input_directory: "./docs"
57
-
58
- # 输出目录
59
- output_directory: "./docs-translated"
60
-
61
- # 文件名后缀策略
62
- filename_strategy: "suffix" # suffix, replace, directory
63
- filename_suffix: ".zh" # 仅当 strategy 为 suffix 时使用
64
-
65
- # 包含的文件模式
66
- include_patterns:
67
- - "**/*.md"
68
- - "**/*.markdown"
69
-
70
- # 排除的文件模式
71
- exclude_patterns:
72
- - "**/node_modules/**"
73
- - "**/.*"
74
- - "**/*.tmp"
75
- - "**/README.md" # 示例:排除 README 文件
76
-
77
- # 是否保持目录结构
78
- preserve_directory_structure: true
79
-
80
- # 文件覆盖策略
81
- overwrite_policy: "ask" # ask, overwrite, skip, backup
82
-
83
- # 备份目录(当 overwrite_policy 为 backup 时)
84
- backup_directory: "./backups"
85
-
86
- # 日志配置
87
- logging:
88
- # 日志级别
89
- level: "info" # debug, info, warn, error
90
-
91
- # 日志输出位置
92
- output: "console" # console, file, both
93
-
94
- # 日志文件路径(当 output 包含 file 时)
95
- file_path: "./logs/translator.log"
96
-
97
- # 是否记录详细的翻译过程
98
- verbose_translation: false
99
-
100
- # 错误日志文件
101
- error_log_path: "./logs/errors.log"
102
-
103
- # 错误处理配置
104
- error_handling:
105
- # 遇到错误时的行为
106
- on_error: "log_and_continue" # stop, log_and_continue, skip_file
107
-
108
- # 最大连续错误数(超过则停止)
109
- max_consecutive_errors: 5
110
-
111
- # 错误重试次数
112
- retry_on_failure: 2
113
-
114
- # 生成错误报告
115
- generate_error_report: true
116
- error_report_path: "./logs/error_report.md"
117
-
118
- # 性能配置
119
- performance:
120
- # 并发处理文件数
121
- concurrent_files: 3
122
-
123
- # 批处理大小(同时翻译的文件数)
124
- batch_size: 5
125
-
126
- # 请求间隔(避免 API 限流)
127
- request_interval: 1 # 秒
128
-
129
- # 内存使用限制
130
- max_memory_mb: 500
131
-
132
- # 输出配置
133
- output:
134
- # 是否显示进度条
135
- show_progress: true
136
-
137
- # 是否显示翻译统计
138
- show_statistics: true
139
-
140
- # 是否生成翻译报告
141
- generate_report: true
142
- report_path: "./reports/translation_report.md"
143
-
144
- # 输出格式
145
- format: "markdown" # markdown, json, yaml
146
-
147
- # 是否保留元数据
148
- include_metadata: true
149
-
150
- # 预设配置(可通过 --preset 参数使用)
151
- presets:
152
- chinese:
153
- translation:
154
- target_language: "zh-CN"
155
- default_prompt: "翻译为简体中文,保持技术术语的准确性"
156
-
157
- japanese:
158
- translation:
159
- target_language: "ja"
160
- default_prompt: "日本語に翻訳してください。技術用語は正確に保ってください"
161
-
162
- english:
163
- translation:
164
- target_language: "en"
165
- default_prompt: "Translate to English, maintaining technical accuracy"
166
-
167
- # 自定义 Hook(高级功能)
168
- hooks:
169
- # 翻译前处理
170
- pre_translation: null
171
-
172
- # 翻译后处理
173
- post_translation: null
174
-
175
- # 文件处理完成后
176
- post_file_processing: null
data/test_new_config.yml DELETED
@@ -1,184 +0,0 @@
1
- # llm_translate.yml - 翻译工具配置文件
2
-
3
- # AI 模型配置
4
- ai:
5
- # API 密钥
6
- api_key: xxxx
7
-
8
- # API 主机地址
9
- host: https://aihubmix.com
10
-
11
- # 模型提供商
12
- provider: "claude"
13
-
14
- # 模型名称
15
- model: "claude-3-7-sonnet-20250219"
16
-
17
- # 模型参数
18
- temperature: 0.3
19
- max_tokens: 4000
20
- top_p: 1.0
21
-
22
- # 请求重试配置
23
- retry_attempts: 3
24
- retry_delay: 2 # 秒
25
-
26
- # 请求超时时间
27
- timeout: 60 # 秒
28
-
29
- # 翻译配置
30
- translation:
31
- # 默认翻译 prompt
32
- default_prompt: |
33
- 请将以下 Markdown 内容翻译为中文,保持所有格式不变:
34
- - 保留代码块、链接、图片等 Markdown 语法
35
- - 保留英文的专业术语和产品名称
36
- - 确保翻译自然流畅
37
-
38
- 内容:
39
- {content}
40
-
41
- # 目标语言
42
- target_language: "zh-CN"
43
-
44
- # 源语言(auto 为自动检测)
45
- source_language: "auto"
46
-
47
- # 是否保留原文格式
48
- preserve_formatting: true
49
-
50
- # 是否翻译代码注释
51
- translate_code_comments: false
52
-
53
-
54
-
55
- # 文件处理配置
56
- files:
57
- # 输入目录
58
- input_directory: "./docs"
59
-
60
- # 输出目录
61
- output_directory: "./docs-translated"
62
-
63
- # 输入文件
64
- input_file: "./README.md"
65
-
66
- # 输出文件
67
- output_file: "./README.zh.md"
68
-
69
- # 文件名后缀策略
70
- filename_strategy: "suffix" # suffix, replace, directory
71
- filename_suffix: ".zh" # 仅当 strategy 为 suffix 时使用
72
-
73
- # 包含的文件模式
74
- include_patterns:
75
- - "**/*.md"
76
- - "**/*.markdown"
77
-
78
- # 排除的文件模式
79
- exclude_patterns:
80
- - "**/node_modules/**"
81
- - "**/.*"
82
- - "**/*.tmp"
83
- - "**/README.md" # 示例:排除 README 文件
84
-
85
- # 是否保持目录结构
86
- preserve_directory_structure: true
87
-
88
- # 文件覆盖策略
89
- overwrite_policy: "ask" # ask, overwrite, skip, backup
90
-
91
- # 备份目录(当 overwrite_policy 为 backup 时)
92
- backup_directory: "./backups"
93
-
94
- # 日志配置
95
- logging:
96
- # 日志级别
97
- level: "info" # debug, info, warn, error
98
-
99
- # 日志输出位置
100
- output: "console" # console, file, both
101
-
102
- # 日志文件路径(当 output 包含 file 时)
103
- file_path: "./logs/llm_translate.log"
104
-
105
- # 是否记录详细的翻译过程
106
- verbose_translation: false
107
-
108
- # 错误日志文件
109
- error_log_path: "./logs/errors.log"
110
-
111
- # 错误处理配置
112
- error_handling:
113
- # 遇到错误时的行为
114
- on_error: "log_and_continue" # stop, log_and_continue, skip_file
115
-
116
- # 最大连续错误数(超过则停止)
117
- max_consecutive_errors: 5
118
-
119
- # 错误重试次数
120
- retry_on_failure: 2
121
-
122
- # 生成错误报告
123
- generate_error_report: true
124
- error_report_path: "./logs/error_report.md"
125
-
126
- # 性能配置
127
- performance:
128
- # 并发处理文件数
129
- concurrent_files: 3
130
-
131
- # 批处理大小(同时翻译的文件数)
132
- batch_size: 5
133
-
134
- # 请求间隔(避免 API 限流)
135
- request_interval: 1 # 秒
136
-
137
- # 内存使用限制
138
- max_memory_mb: 500
139
-
140
- # 输出配置
141
- output:
142
- # 是否显示进度条
143
- show_progress: true
144
-
145
- # 是否显示翻译统计
146
- show_statistics: true
147
-
148
- # 是否生成翻译报告
149
- generate_report: true
150
- report_path: "./reports/translation_report.md"
151
-
152
- # 输出格式
153
- format: "markdown" # markdown, json, yaml
154
-
155
- # 是否保留元数据
156
- include_metadata: true
157
-
158
- # 预设配置(可通过 --preset 参数使用)
159
- presets:
160
- chinese:
161
- translation:
162
- target_language: "zh-CN"
163
- default_prompt: "翻译为简体中文,保持技术术语的准确性"
164
-
165
- japanese:
166
- translation:
167
- target_language: "ja"
168
- default_prompt: "日本語に翻訳してください。技術用語は正確に保ってください"
169
-
170
- english:
171
- translation:
172
- target_language: "en"
173
- default_prompt: "Translate to English, maintaining technical accuracy"
174
-
175
- # 自定义 Hook(高级功能)
176
- hooks:
177
- # 翻译前处理
178
- pre_translation: null
179
-
180
- # 翻译后处理
181
- post_translation: null
182
-
183
- # 文件处理完成后
184
- post_file_processing: null