star-dlp 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 0f53a06472e77562560428a8f65cce8a670f26d0d430cd2c947dc8843cd5c697
4
- data.tar.gz: 5e9ca4e9a2fc62a854fc1f8b2f8473c5379f8e03c6f8772e9f4eeb1055cbc4dd
3
+ metadata.gz: 49aecf46afd8779a951f317d8412ae41d157bcb50d6df163db0eec69f556881b
4
+ data.tar.gz: 4f3b4809beb3fddc5508f2f2e55cd62012829ff6f6b142b28034eee4e2aaedc0
5
5
  SHA512:
6
- metadata.gz: 290b940570744dc5ed74fbdfeda58794312109c5412a7751b15d40d3eb857418c676414e0ee692a1db31bd2458ff566e1e4437a41786102386563df065cd4b59
7
- data.tar.gz: f186110f59e875bd99aa586544ad8a2d69243172d45ee4b76bfa458e373808d9dd6ee8fa9e952251c06e9264a009bf6b3408556334445ed88c4cd4dded8ac8c5
6
+ metadata.gz: 824c986da6d7c0e30f058bec67b254d289024da4f7effdfbf1975af7d2a5414671e551d561df110331d5cdcff2a3c8f3427028600ade35d86fe3958264df370a
7
+ data.tar.gz: 18278facb4fd629b173af6f78f9a3c0b10c8ec2cd035590031f4a7fbe30d9a0d016d39fc2c6c2cfc3d285a5b52ac2cd41f6ea250a65a5e0e2589e3b54e213c1a
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- star-dlp (0.1.0)
4
+ star-dlp (0.1.2)
5
5
  fileutils (~> 1.6)
6
6
  github_api (~> 0.19.0)
7
7
  json (~> 2.6)
data/README.md CHANGED
@@ -48,6 +48,44 @@ $ star-dlp download your_github_username
48
48
 
49
49
  This will download all your starred repositories and save them as JSON and Markdown files. If you've previously downloaded some repositories, it will only download newly starred repositories.
50
50
 
51
+ Available options:
52
+ - `--token`: GitHub API token
53
+ - `--output_dir`: Output directory
54
+ - `--json_dir`: JSON files directory
55
+ - `--markdown_dir`: Markdown files directory
56
+ - `--threads`: Number of download threads (default: 16)
57
+ - `--skip_readme`: Skip downloading README files
58
+ - `--retry_count`: Number of retry attempts for failed downloads (default: 5)
59
+ - `--retry_delay`: Delay in seconds between retry attempts (default: 1)
60
+
61
+ Example with options:
62
+
63
+ ```bash
64
+ $ star-dlp download your_github_username --threads=8 --skip_readme --retry_count=3
65
+ ```
66
+
67
+ ### Downloading READMEs
68
+
69
+ If you've already downloaded your starred repositories but want to download or update their README files separately:
70
+
71
+ ```bash
72
+ $ star-dlp download_readme
73
+ ```
74
+
75
+ This command will scan your JSON files directory, extract repository information, and download README files for repositories that don't already have them.
76
+
77
+ Available options:
78
+ - `--threads`: Number of download threads (default: 16)
79
+ - `--retry_count`: Number of retry attempts for failed downloads (default: 5)
80
+ - `--retry_delay`: Delay in seconds between retry attempts (default: 1)
81
+ - `--force`: Force download even if README was already downloaded
82
+
83
+ Example with options:
84
+
85
+ ```bash
86
+ $ star-dlp download_readme --threads=8 --force
87
+ ```
88
+
51
89
  ### View Version
52
90
 
53
91
  ```bash
@@ -60,8 +98,10 @@ Star-DLP saves files in the following locations:
60
98
 
61
99
  - Configuration file: `~/.star-dlp/config.json`
62
100
  - Starred repositories: `~/.star-dlp/stars/`
63
- - JSON files: `~/.star-dlp/stars/json/`
64
- - Markdown files: `~/.star-dlp/stars/markdown/`
101
+ - JSON files: `~/.star-dlp/stars/json/YYYY/MM/YYYYMMDD.owner.repo.json`
102
+ - Markdown files: `~/.star-dlp/stars/markdown/YYYY/MM/YYYYMMDD.owner.repo.md`
103
+ - Last downloaded repository: `~/.star-dlp/stars/last_downloaded_repo.txt`
104
+ - Downloaded READMEs list: `~/.star-dlp/stars/downloaded_readmes.txt`
65
105
 
66
106
  ## Development
67
107
 
data/README_zh.md CHANGED
@@ -48,6 +48,44 @@ $ star-dlp download your_github_username
48
48
 
49
49
  这将下载您所有的星标仓库,并将它们保存为 JSON 和 Markdown 文件。如果您之前已经下载过一些仓库,它只会下载新的星标仓库。
50
50
 
51
+ 可用选项:
52
+ - `--token`: GitHub API 令牌
53
+ - `--output_dir`: 输出目录
54
+ - `--json_dir`: JSON 文件目录
55
+ - `--markdown_dir`: Markdown 文件目录
56
+ - `--threads`: 下载线程数 (默认: 16)
57
+ - `--skip_readme`: 跳过下载 README 文件
58
+ - `--retry_count`: 下载失败时的重试次数 (默认: 5)
59
+ - `--retry_delay`: 重试之间的延迟秒数 (默认: 1)
60
+
61
+ 带选项的示例:
62
+
63
+ ```bash
64
+ $ star-dlp download your_github_username --threads=8 --skip_readme --retry_count=3
65
+ ```
66
+
67
+ ### 下载 README 文件
68
+
69
+ 如果您已经下载了星标仓库,但想单独下载或更新它们的 README 文件:
70
+
71
+ ```bash
72
+ $ star-dlp download_readme
73
+ ```
74
+
75
+ 此命令将扫描您的 JSON 文件目录,提取仓库信息,并为尚未下载 README 的仓库下载 README 文件。
76
+
77
+ 可用选项:
78
+ - `--threads`: 下载线程数 (默认: 16)
79
+ - `--retry_count`: 下载失败时的重试次数 (默认: 5)
80
+ - `--retry_delay`: 重试之间的延迟秒数 (默认: 1)
81
+ - `--force`: 强制下载,即使 README 已经下载过
82
+
83
+ 带选项的示例:
84
+
85
+ ```bash
86
+ $ star-dlp download_readme --threads=8 --force
87
+ ```
88
+
51
89
  ### 查看版本
52
90
 
53
91
  ```bash
@@ -60,8 +98,10 @@ Star-DLP 将文件保存在以下位置:
60
98
 
61
99
  - 配置文件: `~/.star-dlp/config.json`
62
100
  - 星标仓库: `~/.star-dlp/stars/`
63
- - JSON 文件: `~/.star-dlp/stars/json/`
64
- - Markdown 文件: `~/.star-dlp/stars/markdown/`
101
+ - JSON 文件: `~/.star-dlp/stars/json/YYYY/MM/YYYYMMDD.owner.repo.json`
102
+ - Markdown 文件: `~/.star-dlp/stars/markdown/YYYY/MM/YYYYMMDD.owner.repo.md`
103
+ - 最后下载的仓库: `~/.star-dlp/stars/last_downloaded_repo.txt`
104
+ - 已下载 README 列表: `~/.star-dlp/stars/downloaded_readmes.txt`
65
105
 
66
106
  ## 开发
67
107
 
data/lib/star/dlp/cli.rb CHANGED
@@ -1,6 +1,9 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require "thor"
4
+ require "fileutils"
5
+ require "json"
6
+ require "time"
4
7
  require_relative "config"
5
8
  require_relative "downloader"
6
9
 
@@ -12,6 +15,10 @@ module Star
12
15
  option :output_dir, type: :string, desc: "Output directory for stars"
13
16
  option :json_dir, type: :string, desc: "Directory for JSON files"
14
17
  option :markdown_dir, type: :string, desc: "Directory for Markdown files"
18
+ option :threads, type: :numeric, default: 16, desc: "Number of download threads"
19
+ option :skip_readme, type: :boolean, default: false, desc: "Skip downloading README files"
20
+ option :retry_count, type: :numeric, default: 5, desc: "Number of retry attempts for failed downloads"
21
+ option :retry_delay, type: :numeric, default: 1, desc: "Delay in seconds between retry attempts"
15
22
  def download(username)
16
23
  config = Config.load
17
24
 
@@ -24,10 +31,42 @@ module Star
24
31
  # Save config for future use
25
32
  config.save
26
33
 
27
- downloader = Downloader.new(config, username)
34
+ downloader = Downloader.new(
35
+ config,
36
+ username,
37
+ thread_count: options[:threads],
38
+ skip_readme: options[:skip_readme],
39
+ retry_count: options[:retry_count],
40
+ retry_delay: options[:retry_delay]
41
+ )
28
42
  downloader.download
29
43
  end
30
44
 
45
+ desc "download_readme", "Download READMEs for all repositories from JSON files"
46
+ option :threads, type: :numeric, default: 16, desc: "Number of download threads"
47
+ option :retry_count, type: :numeric, default: 5, desc: "Number of retry attempts for failed downloads"
48
+ option :retry_delay, type: :numeric, default: 1, desc: "Delay in seconds between retry attempts"
49
+ option :force, type: :boolean, default: false, desc: "Force download even if README was already downloaded"
50
+ def download_readme
51
+ config = Config.load
52
+
53
+ # Create a downloader instance
54
+ downloader = Downloader.new(
55
+ config,
56
+ "readme_downloader", # Placeholder username
57
+ thread_count: options[:threads],
58
+ retry_count: options[:retry_count],
59
+ retry_delay: options[:retry_delay]
60
+ )
61
+
62
+ # Call the download_readmes method in the Downloader class
63
+ result = downloader.download_readmes(force: options[:force])
64
+
65
+ puts "README download completed!"
66
+ puts "Successfully downloaded: #{result[:success]}"
67
+ puts "Failed or not found: #{result[:failed]}"
68
+ end
69
+
31
70
  desc "config", "Configure star-dlp"
32
71
  option :token, type: :string, desc: "GitHub API token"
33
72
  option :output_dir, type: :string, desc: "Output directory for stars"
@@ -2,9 +2,12 @@
2
2
 
3
3
  require "github_api"
4
4
  require "json"
5
+ require "tempfile"
5
6
  require "fileutils"
6
7
  require "time"
7
8
  require "base64"
9
+ require "thread"
10
+ require "open3"
8
11
 
9
12
  module Star
10
13
  module Dlp
@@ -12,10 +15,45 @@ module Star
12
15
  attr_reader :config, :github, :username
13
16
 
14
17
  LAST_REPO_FILE = "last_downloaded_repo.txt"
18
+ DOWNLOADED_READMES_FILE = "downloaded_readmes.txt"
19
+ DEFAULT_THREAD_COUNT = 16
20
+ DEFAULT_RETRY_COUNT = 5
21
+ DEFAULT_RETRY_DELAY = 1 # seconds
15
22
 
16
- def initialize(config, username)
23
+ # Supported README formats in order of preference
24
+ README_FORMATS = [
25
+ "README.md",
26
+ "README.markdown",
27
+ "readme.md",
28
+ "README.org",
29
+ "README.rst",
30
+ "README.txt",
31
+ "README.rdoc",
32
+ "README.adoc",
33
+ "README",
34
+ "readme.org",
35
+ "readme.rst",
36
+ "readme.txt",
37
+ "readme.rdoc",
38
+ "readme.adoc",
39
+ "readme"
40
+ ]
41
+
42
+ # Formats that need conversion to markdown
43
+ FORMATS_NEEDING_CONVERSION = {
44
+ ".org" => "org",
45
+ ".rst" => "rst",
46
+ ".txt" => "txt",
47
+ "" => "txt" # For files without extension
48
+ }
49
+
50
+ def initialize(config, username, thread_count: DEFAULT_THREAD_COUNT, skip_readme: false, retry_count: DEFAULT_RETRY_COUNT, retry_delay: DEFAULT_RETRY_DELAY)
17
51
  @config = config
18
52
  @username = username
53
+ @thread_count = thread_count
54
+ @skip_readme = skip_readme
55
+ @retry_count = retry_count
56
+ @retry_delay = retry_delay
19
57
 
20
58
  # Initialize GitHub API client with the special Accept header for starred_at field
21
59
  options = {
@@ -98,14 +136,19 @@ module Star
98
136
 
99
137
  puts "Found #{new_stars.size} new starred repositories to download"
100
138
 
101
- # Save new stars
139
+ # Save new stars using multiple threads
102
140
  if new_stars.any?
103
- puts "Downloading new repositories:"
104
- new_stars.each_with_index do |star, index|
105
- puts " [#{index + 1}/#{new_stars.size}] Downloading: #{get_repo_full_name(star)}"
106
- save_star_as_json(star)
107
- save_star_as_markdown(star)
108
- end
141
+ puts "Downloading new repositories using #{@thread_count} threads:"
142
+
143
+ # Process stars with multithreading
144
+ process_items_with_threads(
145
+ new_stars,
146
+ ->(star) { get_repo_full_name(star) },
147
+ ->(star) {
148
+ save_star_as_json(star)
149
+ save_star_as_markdown(star)
150
+ }
151
+ )
109
152
 
110
153
  puts "Download completed successfully!"
111
154
  else
@@ -119,8 +162,421 @@ module Star
119
162
  end
120
163
  end
121
164
 
165
+ # Download READMEs for all repositories from JSON files
166
+ def download_readmes(force: false)
167
+ puts "Downloading READMEs for repositories from JSON files"
168
+
169
+ # File to track repositories with downloaded READMEs
170
+ downloaded_readmes_file = File.join(config.output_dir, DOWNLOADED_READMES_FILE)
171
+
172
+ # Load list of repositories with already downloaded READMEs
173
+ downloaded_repos = Set.new
174
+ if File.exist?(downloaded_readmes_file) && !force
175
+ File.readlines(downloaded_readmes_file).each do |line|
176
+ downloaded_repos.add(line.strip)
177
+ end
178
+ puts "Found #{downloaded_repos.size} repositories with already downloaded READMEs"
179
+ end
180
+
181
+ # Find all JSON files in the json directory
182
+ json_files = Dir.glob(File.join(config.json_dir, "**", "*.json"))
183
+ puts "Found #{json_files.size} JSON files"
184
+
185
+ # Extract repository names from JSON files
186
+ repos_to_process = []
187
+ repo_dates = {} # Store starred_at dates for repositories
188
+
189
+ json_files.each do |json_file|
190
+ begin
191
+ data = JSON.parse(File.read(json_file))
192
+
193
+ # Extract repository full name from JSON data
194
+ repo_full_name = nil
195
+ starred_at = nil
196
+
197
+ if data.is_a?(Hash) && data["repo"] && data["repo"]["full_name"]
198
+ repo_full_name = data["repo"]["full_name"]
199
+ starred_at = data["starred_at"] if data.key?("starred_at")
200
+ elsif data.is_a?(Hash) && data["full_name"]
201
+ repo_full_name = data["full_name"]
202
+ starred_at = data["starred_at"] if data.key?("starred_at")
203
+ elsif File.basename(json_file) =~ /(\d{8})\.(.+)\.json$/
204
+ # Try to extract from filename (format: YYYYMMDD.owner.repo.json)
205
+ date_str = $1
206
+ parts = $2.split('.')
207
+ if parts.size >= 2
208
+ repo_full_name = "#{parts[0]}/#{parts[1]}"
209
+ # Convert YYYYMMDD to ISO date format
210
+ if date_str =~ /^(\d{4})(\d{2})(\d{2})$/
211
+ starred_at = "#{$1}-#{$2}-#{$3}T00:00:00Z"
212
+ end
213
+ end
214
+ end
215
+
216
+ # Skip if we couldn't determine the repository name or if README was already downloaded
217
+ next if repo_full_name.nil?
218
+ next if downloaded_repos.include?(repo_full_name) && !force
219
+
220
+ repos_to_process << repo_full_name
221
+ # Store the starred_at date if available
222
+ repo_dates[repo_full_name] = starred_at if starred_at
223
+ rescue JSON::ParserError => e
224
+ puts "Error parsing JSON file #{json_file}: #{e.message}"
225
+ end
226
+ end
227
+
228
+ puts "Found #{repos_to_process.size} repositories that need README downloads"
229
+
230
+ # Create a mutex for thread-safe file writing
231
+ mutex = Mutex.new
232
+ success_count = 0
233
+ failed_count = 0
234
+
235
+ # Process repositories with multithreading
236
+ result = process_items_with_threads(
237
+ repos_to_process,
238
+ ->(repo) { repo }, # Item name is the repo name itself
239
+ ->(repo_full_name) {
240
+ # Try to download README
241
+ readme_result = fetch_readme(repo_full_name)
242
+
243
+ if readme_result && readme_result[:content]
244
+ # Get starred_at date if available, or use current date as fallback
245
+ date = nil
246
+ if repo_dates.key?(repo_full_name) && repo_dates[repo_full_name]
247
+ begin
248
+ date = Time.parse(repo_dates[repo_full_name])
249
+ rescue
250
+ date = Time.now
251
+ end
252
+ else
253
+ date = Time.now
254
+ end
255
+
256
+ # Create markdown file path
257
+ md_filepath = get_markdown_filepath(repo_full_name, date)
258
+
259
+ mutex.synchronize do
260
+ # Check if file exists
261
+ if File.exist?(md_filepath)
262
+ # Append README content to existing file
263
+ File.open(md_filepath, 'a') do |file|
264
+ file.puts "\n\n## README"
265
+ file.puts "\n*Format: #{readme_result[:format]}*\n" if readme_result[:format] != "markdown"
266
+ file.puts "\n#{readme_result[:content]}\n"
267
+ end
268
+ else
269
+ # Create new file with repository information and README
270
+ content = <<~MARKDOWN
271
+ # #{repo_full_name}
272
+
273
+ - **Downloaded at**: #{Time.now.iso8601}
274
+ - **Starred at**: #{date.iso8601}
275
+
276
+ [View on GitHub](https://github.com/#{repo_full_name})
277
+
278
+ ## README
279
+ MARKDOWN
280
+
281
+ # Add format note if not markdown
282
+ content += "\n*Format: #{readme_result[:format]}*\n" if readme_result[:format] != "markdown"
283
+
284
+ # Add README content
285
+ content += "\n#{readme_result[:content]}\n"
286
+
287
+ File.write(md_filepath, content)
288
+ end
289
+
290
+ # Add to downloaded repositories list
291
+ File.open(downloaded_readmes_file, 'a') do |file|
292
+ file.puts repo_full_name
293
+ end
294
+
295
+ success_count += 1
296
+ end
297
+
298
+ true
299
+ else
300
+ mutex.synchronize do
301
+ puts "No README found for #{repo_full_name}"
302
+ failed_count += 1
303
+ end
304
+ true # Mark as success even if README not found to avoid retries
305
+ end
306
+ }
307
+ )
308
+
309
+ puts "README download completed!"
310
+ puts "Successfully downloaded: #{success_count}"
311
+ puts "Failed or not found: #{failed_count}"
312
+
313
+ return {
314
+ total: repos_to_process.size,
315
+ success: success_count,
316
+ failed: failed_count
317
+ }
318
+ end
319
+
320
+ # Fetch README content from GitHub
321
+ # Returns a hash with :content and :format keys, or nil if not found
322
+ def fetch_readme(repo_full_name)
323
+ # Try each README format in order
324
+ README_FORMATS.each do |readme_path|
325
+ begin
326
+ # Get README content using GitHub API
327
+ response = github.repos.contents.get(
328
+ user: repo_full_name.split('/').first,
329
+ repo: repo_full_name.split('/').last,
330
+ path: readme_path
331
+ )
332
+
333
+ # Decode content from Base64
334
+ if response.content && response.encoding == 'base64'
335
+ content = Base64.decode64(response.content).force_encoding('UTF-8')
336
+
337
+ # Get file extension
338
+ ext = File.extname(readme_path).downcase
339
+
340
+ # Check if we need to convert the content
341
+ if FORMATS_NEEDING_CONVERSION.key?(ext)
342
+ format = FORMATS_NEEDING_CONVERSION[ext]
343
+ puts "Converting #{readme_path} from #{format} to markdown for #{repo_full_name}"
344
+
345
+ # Create a temporary file with the content
346
+ temp_file = Tempfile.new(['readme', ".#{format}"])
347
+ begin
348
+ temp_file.write(content)
349
+ temp_file.close
350
+
351
+ # Use pandoc to convert to markdown
352
+ markdown_content, status = convert_to_markdown(temp_file.path, format)
353
+
354
+ if status.success?
355
+ return { content: markdown_content, format: format }
356
+ else
357
+ puts "Pandoc conversion failed for #{repo_full_name}, using original content"
358
+ return { content: content, format: format }
359
+ end
360
+ ensure
361
+ temp_file.unlink
362
+ end
363
+ else
364
+ # Already markdown, no conversion needed
365
+ return { content: content, format: "markdown" }
366
+ end
367
+ end
368
+ rescue Github::Error::NotFound
369
+ # Try next format
370
+ next
371
+ rescue => e
372
+ puts "Error fetching #{readme_path} for #{repo_full_name}: #{e.message}"
373
+ next
374
+ end
375
+ end
376
+
377
+ # No README found in predefined formats, check for any readme-like file in the root directory
378
+ begin
379
+ # Get repository contents
380
+ contents = github.repos.contents.get(
381
+ user: repo_full_name.split('/').first,
382
+ repo: repo_full_name.split('/').last,
383
+ path: "" # Root directory
384
+ )
385
+
386
+ # Look for any file with name matching /readme/i
387
+ readme_file = contents.find { |item| item.type == "file" && item.name =~ /readme/i }
388
+
389
+ if readme_file
390
+ puts "Found alternative README file: #{readme_file.name} for #{repo_full_name}"
391
+
392
+ # Get README content
393
+ readme_content = github.repos.contents.get(
394
+ user: repo_full_name.split('/').first,
395
+ repo: repo_full_name.split('/').last,
396
+ path: readme_file.name
397
+ )
398
+
399
+ # Decode content from Base64
400
+ if readme_content.content && readme_content.encoding == 'base64'
401
+ content = Base64.decode64(readme_content.content).force_encoding('UTF-8')
402
+
403
+ # Get file extension
404
+ ext = File.extname(readme_file.name).downcase
405
+
406
+ # Check if we need to convert the content
407
+ if FORMATS_NEEDING_CONVERSION.key?(ext)
408
+ format = FORMATS_NEEDING_CONVERSION[ext]
409
+ puts "Converting #{readme_file.name} from #{format} to markdown for #{repo_full_name}"
410
+
411
+ # Create a temporary file with the content
412
+ temp_file = Tempfile.new(['readme', ".#{format}"])
413
+ begin
414
+ temp_file.write(content)
415
+ temp_file.close
416
+
417
+ # Use pandoc to convert to markdown
418
+ markdown_content, status = convert_to_markdown(temp_file.path, format)
419
+
420
+ if status.success?
421
+ return { content: markdown_content, format: format }
422
+ else
423
+ puts "Pandoc conversion failed for #{repo_full_name}, using original content"
424
+ return { content: content, format: format }
425
+ end
426
+ ensure
427
+ temp_file.unlink
428
+ end
429
+ else
430
+ # Determine format based on extension or default to txt
431
+ format = ext.empty? ? "txt" : ext[1..]
432
+ # Use markdown format if extension suggests it's already markdown
433
+ format = "markdown" if [".md", ".markdown"].include?(ext)
434
+
435
+ return { content: content, format: format }
436
+ end
437
+ end
438
+ end
439
+ rescue => e
440
+ puts "Error checking root directory for README-like files for #{repo_full_name}: #{e.message}"
441
+ end
442
+
443
+ # No README found in any format
444
+ nil
445
+ end
446
+
447
+ # Convert content from a given format to markdown using pandoc
448
+ def convert_to_markdown(file_path, format)
449
+ begin
450
+ # Check if pandoc is installed
451
+ version_output, status = Open3.capture2e("pandoc --version")
452
+ unless status.success?
453
+ puts "Warning: pandoc is not installed or not in PATH. Cannot convert non-markdown formats."
454
+ return [File.read(file_path), status]
455
+ end
456
+
457
+ # Use pandoc to convert to markdown
458
+ output, status = Open3.capture2e("pandoc", "-f", format, "-t", "markdown", file_path)
459
+
460
+ if status.success?
461
+ return [output, status]
462
+ else
463
+ puts "Pandoc conversion failed: #{output}"
464
+ return [File.read(file_path), status]
465
+ end
466
+ rescue => e
467
+ puts "Error during conversion: #{e.message}"
468
+ return [File.read(file_path), OpenStruct.new(success?: false)]
469
+ end
470
+ end
471
+
122
472
  private
123
473
 
474
+ # Process a list of items using multiple threads
475
+ # items: Array of items to process
476
+ # name_proc: Proc to get item name for logging
477
+ # process_proc: Proc to process each item
478
+ def process_items_with_threads(items, name_proc, process_proc)
479
+ return if items.empty?
480
+
481
+ # Create a thread-safe queue for the items
482
+ queue = Queue.new
483
+ items.each { |item| queue << item }
484
+
485
+ # Create a mutex for thread-safe output
486
+ mutex = Mutex.new
487
+
488
+ # Create a progress counter
489
+ total = items.size
490
+ completed = 0
491
+
492
+ # Create and start the worker threads
493
+ threads = Array.new(@thread_count) do
494
+ Thread.new do
495
+ until queue.empty?
496
+ # Try to get an item from the queue (non-blocking)
497
+ item = queue.pop(true) rescue nil
498
+ break unless item
499
+
500
+ # Get the item name for logging
501
+ item_name = name_proc.call(item)
502
+
503
+ # Process the item with retry mechanism
504
+ success = false
505
+ retry_count = 0
506
+
507
+ until success || retry_count >= @retry_count
508
+ begin
509
+ # Process the item
510
+ process_proc.call(item)
511
+ success = true
512
+ rescue => e
513
+ retry_count += 1
514
+
515
+ # Log the error and retry information
516
+ mutex.synchronize do
517
+ puts " Error processing #{item_name}: #{e.message}"
518
+ if retry_count < @retry_count
519
+ puts " Retrying in #{@retry_delay} seconds (attempt #{retry_count + 1}/#{@retry_count})..."
520
+ else
521
+ puts " Failed to process after #{@retry_count} attempts."
522
+ end
523
+ end
524
+
525
+ # Wait before retrying
526
+ sleep(@retry_delay)
527
+ end
528
+ end
529
+
530
+ # Update progress
531
+ mutex.synchronize do
532
+ completed += 1
533
+ puts " [#{completed}/#{total}] Processed: #{item_name} (#{(completed.to_f / total * 100).round(1)}%)"
534
+ end
535
+ end
536
+ end
537
+ end
538
+
539
+ # Wait for all threads to complete
540
+ threads.each(&:join)
541
+
542
+ return {
543
+ total: total,
544
+ completed: completed
545
+ }
546
+ end
547
+
548
+ # Get the markdown file path for a repository
549
+ def get_markdown_filepath(repo_full_name, date = Time.now)
550
+ # Create directory structure based on date: markdown/YYYY/MM/
551
+ year_dir = date.strftime("%Y")
552
+ month_dir = date.strftime("%m")
553
+ target_dir = File.join(config.markdown_dir, year_dir, month_dir)
554
+ FileUtils.mkdir_p(target_dir) unless Dir.exist?(target_dir)
555
+
556
+ # Format filename: YYYYMMDD.repo_owner.repo_name.md
557
+ date_str = date.strftime("%Y%m%d")
558
+ repo_name = repo_full_name.gsub('/', '.')
559
+ filename = "#{date_str}.#{repo_name}.md"
560
+
561
+ File.join(target_dir, filename)
562
+ end
563
+
564
+ # Get the JSON file path for a repository
565
+ def get_json_filepath(repo_full_name, date = Time.now)
566
+ # Create directory structure based on date: json/YYYY/MM/
567
+ year_dir = date.strftime("%Y")
568
+ month_dir = date.strftime("%m")
569
+ target_dir = File.join(config.json_dir, year_dir, month_dir)
570
+ FileUtils.mkdir_p(target_dir) unless Dir.exist?(target_dir)
571
+
572
+ # Format filename: YYYYMMDD.repo_owner.repo_name.json
573
+ date_str = date.strftime("%Y%m%d")
574
+ repo_name = repo_full_name.gsub('/', '.')
575
+ filename = "#{date_str}.#{repo_name}.json"
576
+
577
+ File.join(target_dir, filename)
578
+ end
579
+
124
580
  def get_last_repo_name
125
581
  last_repo_file = File.join(config.output_dir, LAST_REPO_FILE)
126
582
  return nil unless File.exist?(last_repo_file)
@@ -133,25 +589,19 @@ module Star
133
589
  File.write(last_repo_file, repo_name)
134
590
  end
135
591
 
136
-
137
592
  def save_star_as_json(star)
138
593
  star_data = star.to_hash
139
594
 
140
595
  # Get starred_at date or use current date as fallback
141
596
  starred_at = star.respond_to?(:starred_at) ? Time.parse(star.starred_at) : Time.now
142
597
 
143
- # Create directory structure based on starred_at date: json/YYYY/MM/
144
- year_dir = starred_at.strftime("%Y")
145
- month_dir = starred_at.strftime("%m")
146
- target_dir = File.join(config.json_dir, year_dir, month_dir)
147
- FileUtils.mkdir_p(target_dir) unless Dir.exist?(target_dir)
598
+ # Get the repository name
599
+ repo_full_name = get_repo_full_name(star)
148
600
 
149
- # Format filename: YYYYMMDD.username.repo_name.json
150
- date_str = starred_at.strftime("%Y%m%d")
151
- repo_name = get_repo_full_name(star).gsub('/', '.')
152
- filename = "#{date_str}.#{repo_name}.json"
601
+ # Get the JSON file path
602
+ filepath = get_json_filepath(repo_full_name, starred_at)
153
603
 
154
- filepath = File.join(target_dir, filename)
604
+ # Write the JSON file
155
605
  File.write(filepath, JSON.pretty_generate(star_data))
156
606
  end
157
607
 
@@ -159,19 +609,14 @@ module Star
159
609
  # Get starred_at date or use current date as fallback
160
610
  starred_at = star.respond_to?(:starred_at) ? Time.parse(star.starred_at) : Time.now
161
611
 
162
- # Create directory structure based on starred_at date: markdown/YYYY/MM/
163
- year_dir = starred_at.strftime("%Y")
164
- month_dir = starred_at.strftime("%m")
165
- target_dir = File.join(config.markdown_dir, year_dir, month_dir)
166
- FileUtils.mkdir_p(target_dir) unless Dir.exist?(target_dir)
167
-
168
- # Format filename: YYYYMMDD.username.repo_name.md
169
- date_str = starred_at.strftime("%Y%m%d")
612
+ # Get the repository name
170
613
  repo_full_name = get_repo_full_name(star)
171
- repo_name = repo_full_name.gsub('/', '.')
172
- filename = "#{date_str}.#{repo_name}.md"
173
614
 
174
- filepath = File.join(target_dir, filename)
615
+ # Get the markdown file path
616
+ filepath = get_markdown_filepath(repo_full_name, starred_at)
617
+
618
+ # Skip if file already exists
619
+ return if File.exist?(filepath)
175
620
 
176
621
  # Include starred_at in the markdown
177
622
  starred_at_str = star.respond_to?(:starred_at) ? star.starred_at : "N/A"
@@ -196,10 +641,17 @@ module Star
196
641
  #{(get_topics(star) || []).map { |topic| "- #{topic}" }.join("\n")}
197
642
  MARKDOWN
198
643
 
199
- # Try to fetch README.md content
200
- readme_content = fetch_readme(repo_full_name)
201
- if readme_content
202
- content += "\n\n## README\n\n#{readme_content}\n"
644
+ # Try to fetch README.md content if not skipped
645
+ unless @skip_readme
646
+ readme_result = fetch_readme(repo_full_name)
647
+ if readme_result && readme_result[:content]
648
+ content += "\n\n## README"
649
+ # Add format note if not markdown
650
+ content += "\n*Format: #{readme_result[:format]}*\n" if readme_result[:format] != "markdown"
651
+ content += "\n#{readme_result[:content]}\n"
652
+ else
653
+ content += "\n\n## Description\n\n#{get_description(star)}\n"
654
+ end
203
655
  else
204
656
  content += "\n\n## Description\n\n#{get_description(star)}\n"
205
657
  end
@@ -297,63 +749,6 @@ module Star
297
749
  []
298
750
  end
299
751
  end
300
-
301
- # Fetch README.md content from GitHub
302
- def fetch_readme(repo_full_name)
303
- begin
304
- # Get README content using GitHub API
305
- response = github.repos.contents.get(
306
- user: repo_full_name.split('/').first,
307
- repo: repo_full_name.split('/').last,
308
- path: 'README.md'
309
- )
310
-
311
- # Decode content from Base64
312
- if response.content && response.encoding == 'base64'
313
- return Base64.decode64(response.content).force_encoding('UTF-8')
314
- end
315
- rescue Github::Error::NotFound
316
- # Try README.markdown if README.md not found
317
- begin
318
- response = github.repos.contents.get(
319
- user: repo_full_name.split('/').first,
320
- repo: repo_full_name.split('/').last,
321
- path: 'README.markdown'
322
- )
323
-
324
- if response.content && response.encoding == 'base64'
325
- return Base64.decode64(response.content).force_encoding('UTF-8')
326
- end
327
- rescue Github::Error::NotFound
328
- # Try readme.md (lowercase) if previous attempts failed
329
- begin
330
- response = github.repos.contents.get(
331
- user: repo_full_name.split('/').first,
332
- repo: repo_full_name.split('/').last,
333
- path: 'readme.md'
334
- )
335
-
336
- if response.content && response.encoding == 'base64'
337
- return Base64.decode64(response.content).force_encoding('UTF-8')
338
- end
339
- rescue Github::Error::NotFound
340
- # README not found
341
- return nil
342
- rescue => e
343
- puts "Error fetching lowercase readme.md for #{repo_full_name}: #{e.message}"
344
- return nil
345
- end
346
- rescue => e
347
- puts "Error fetching README.markdown for #{repo_full_name}: #{e.message}"
348
- return nil
349
- end
350
- rescue => e
351
- puts "Error fetching README.md for #{repo_full_name}: #{e.message}"
352
- return nil
353
- end
354
-
355
- nil
356
- end
357
752
  end
358
753
  end
359
754
  end
@@ -2,6 +2,6 @@
2
2
 
3
3
  module Star
4
4
  module Dlp
5
- VERSION = "0.1.0"
5
+ VERSION = "0.1.2"
6
6
  end
7
7
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: star-dlp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Liu Xiang
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2025-03-16 00:00:00.000000000 Z
11
+ date: 2025-03-18 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: github_api