star-dlp 0.1.0 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +1 -1
- data/README.md +42 -2
- data/README_zh.md +42 -2
- data/lib/star/dlp/cli.rb +40 -1
- data/lib/star/dlp/downloader.rb +486 -91
- data/lib/star/dlp/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 49aecf46afd8779a951f317d8412ae41d157bcb50d6df163db0eec69f556881b
|
4
|
+
data.tar.gz: 4f3b4809beb3fddc5508f2f2e55cd62012829ff6f6b142b28034eee4e2aaedc0
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 824c986da6d7c0e30f058bec67b254d289024da4f7effdfbf1975af7d2a5414671e551d561df110331d5cdcff2a3c8f3427028600ade35d86fe3958264df370a
|
7
|
+
data.tar.gz: 18278facb4fd629b173af6f78f9a3c0b10c8ec2cd035590031f4a7fbe30d9a0d016d39fc2c6c2cfc3d285a5b52ac2cd41f6ea250a65a5e0e2589e3b54e213c1a
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
@@ -48,6 +48,44 @@ $ star-dlp download your_github_username
|
|
48
48
|
|
49
49
|
This will download all your starred repositories and save them as JSON and Markdown files. If you've previously downloaded some repositories, it will only download newly starred repositories.
|
50
50
|
|
51
|
+
Available options:
|
52
|
+
- `--token`: GitHub API token
|
53
|
+
- `--output_dir`: Output directory
|
54
|
+
- `--json_dir`: JSON files directory
|
55
|
+
- `--markdown_dir`: Markdown files directory
|
56
|
+
- `--threads`: Number of download threads (default: 16)
|
57
|
+
- `--skip_readme`: Skip downloading README files
|
58
|
+
- `--retry_count`: Number of retry attempts for failed downloads (default: 5)
|
59
|
+
- `--retry_delay`: Delay in seconds between retry attempts (default: 1)
|
60
|
+
|
61
|
+
Example with options:
|
62
|
+
|
63
|
+
```bash
|
64
|
+
$ star-dlp download your_github_username --threads=8 --skip_readme --retry_count=3
|
65
|
+
```
|
66
|
+
|
67
|
+
### Downloading READMEs
|
68
|
+
|
69
|
+
If you've already downloaded your starred repositories but want to download or update their README files separately:
|
70
|
+
|
71
|
+
```bash
|
72
|
+
$ star-dlp download_readme
|
73
|
+
```
|
74
|
+
|
75
|
+
This command will scan your JSON files directory, extract repository information, and download README files for repositories that don't already have them.
|
76
|
+
|
77
|
+
Available options:
|
78
|
+
- `--threads`: Number of download threads (default: 16)
|
79
|
+
- `--retry_count`: Number of retry attempts for failed downloads (default: 5)
|
80
|
+
- `--retry_delay`: Delay in seconds between retry attempts (default: 1)
|
81
|
+
- `--force`: Force download even if README was already downloaded
|
82
|
+
|
83
|
+
Example with options:
|
84
|
+
|
85
|
+
```bash
|
86
|
+
$ star-dlp download_readme --threads=8 --force
|
87
|
+
```
|
88
|
+
|
51
89
|
### View Version
|
52
90
|
|
53
91
|
```bash
|
@@ -60,8 +98,10 @@ Star-DLP saves files in the following locations:
|
|
60
98
|
|
61
99
|
- Configuration file: `~/.star-dlp/config.json`
|
62
100
|
- Starred repositories: `~/.star-dlp/stars/`
|
63
|
-
- JSON files: `~/.star-dlp/stars/json
|
64
|
-
- Markdown files: `~/.star-dlp/stars/markdown
|
101
|
+
- JSON files: `~/.star-dlp/stars/json/YYYY/MM/YYYYMMDD.owner.repo.json`
|
102
|
+
- Markdown files: `~/.star-dlp/stars/markdown/YYYY/MM/YYYYMMDD.owner.repo.md`
|
103
|
+
- Last downloaded repository: `~/.star-dlp/stars/last_downloaded_repo.txt`
|
104
|
+
- Downloaded READMEs list: `~/.star-dlp/stars/downloaded_readmes.txt`
|
65
105
|
|
66
106
|
## Development
|
67
107
|
|
data/README_zh.md
CHANGED
@@ -48,6 +48,44 @@ $ star-dlp download your_github_username
|
|
48
48
|
|
49
49
|
这将下载您所有的星标仓库,并将它们保存为 JSON 和 Markdown 文件。如果您之前已经下载过一些仓库,它只会下载新的星标仓库。
|
50
50
|
|
51
|
+
可用选项:
|
52
|
+
- `--token`: GitHub API 令牌
|
53
|
+
- `--output_dir`: 输出目录
|
54
|
+
- `--json_dir`: JSON 文件目录
|
55
|
+
- `--markdown_dir`: Markdown 文件目录
|
56
|
+
- `--threads`: 下载线程数 (默认: 16)
|
57
|
+
- `--skip_readme`: 跳过下载 README 文件
|
58
|
+
- `--retry_count`: 下载失败时的重试次数 (默认: 5)
|
59
|
+
- `--retry_delay`: 重试之间的延迟秒数 (默认: 1)
|
60
|
+
|
61
|
+
带选项的示例:
|
62
|
+
|
63
|
+
```bash
|
64
|
+
$ star-dlp download your_github_username --threads=8 --skip_readme --retry_count=3
|
65
|
+
```
|
66
|
+
|
67
|
+
### 下载 README 文件
|
68
|
+
|
69
|
+
如果您已经下载了星标仓库,但想单独下载或更新它们的 README 文件:
|
70
|
+
|
71
|
+
```bash
|
72
|
+
$ star-dlp download_readme
|
73
|
+
```
|
74
|
+
|
75
|
+
此命令将扫描您的 JSON 文件目录,提取仓库信息,并为尚未下载 README 的仓库下载 README 文件。
|
76
|
+
|
77
|
+
可用选项:
|
78
|
+
- `--threads`: 下载线程数 (默认: 16)
|
79
|
+
- `--retry_count`: 下载失败时的重试次数 (默认: 5)
|
80
|
+
- `--retry_delay`: 重试之间的延迟秒数 (默认: 1)
|
81
|
+
- `--force`: 强制下载,即使 README 已经下载过
|
82
|
+
|
83
|
+
带选项的示例:
|
84
|
+
|
85
|
+
```bash
|
86
|
+
$ star-dlp download_readme --threads=8 --force
|
87
|
+
```
|
88
|
+
|
51
89
|
### 查看版本
|
52
90
|
|
53
91
|
```bash
|
@@ -60,8 +98,10 @@ Star-DLP 将文件保存在以下位置:
|
|
60
98
|
|
61
99
|
- 配置文件: `~/.star-dlp/config.json`
|
62
100
|
- 星标仓库: `~/.star-dlp/stars/`
|
63
|
-
- JSON 文件: `~/.star-dlp/stars/json
|
64
|
-
- Markdown 文件: `~/.star-dlp/stars/markdown
|
101
|
+
- JSON 文件: `~/.star-dlp/stars/json/YYYY/MM/YYYYMMDD.owner.repo.json`
|
102
|
+
- Markdown 文件: `~/.star-dlp/stars/markdown/YYYY/MM/YYYYMMDD.owner.repo.md`
|
103
|
+
- 最后下载的仓库: `~/.star-dlp/stars/last_downloaded_repo.txt`
|
104
|
+
- 已下载 README 列表: `~/.star-dlp/stars/downloaded_readmes.txt`
|
65
105
|
|
66
106
|
## 开发
|
67
107
|
|
data/lib/star/dlp/cli.rb
CHANGED
@@ -1,6 +1,9 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
require "thor"
|
4
|
+
require "fileutils"
|
5
|
+
require "json"
|
6
|
+
require "time"
|
4
7
|
require_relative "config"
|
5
8
|
require_relative "downloader"
|
6
9
|
|
@@ -12,6 +15,10 @@ module Star
|
|
12
15
|
option :output_dir, type: :string, desc: "Output directory for stars"
|
13
16
|
option :json_dir, type: :string, desc: "Directory for JSON files"
|
14
17
|
option :markdown_dir, type: :string, desc: "Directory for Markdown files"
|
18
|
+
option :threads, type: :numeric, default: 16, desc: "Number of download threads"
|
19
|
+
option :skip_readme, type: :boolean, default: false, desc: "Skip downloading README files"
|
20
|
+
option :retry_count, type: :numeric, default: 5, desc: "Number of retry attempts for failed downloads"
|
21
|
+
option :retry_delay, type: :numeric, default: 1, desc: "Delay in seconds between retry attempts"
|
15
22
|
def download(username)
|
16
23
|
config = Config.load
|
17
24
|
|
@@ -24,10 +31,42 @@ module Star
|
|
24
31
|
# Save config for future use
|
25
32
|
config.save
|
26
33
|
|
27
|
-
downloader = Downloader.new(
|
34
|
+
downloader = Downloader.new(
|
35
|
+
config,
|
36
|
+
username,
|
37
|
+
thread_count: options[:threads],
|
38
|
+
skip_readme: options[:skip_readme],
|
39
|
+
retry_count: options[:retry_count],
|
40
|
+
retry_delay: options[:retry_delay]
|
41
|
+
)
|
28
42
|
downloader.download
|
29
43
|
end
|
30
44
|
|
45
|
+
desc "download_readme", "Download READMEs for all repositories from JSON files"
|
46
|
+
option :threads, type: :numeric, default: 16, desc: "Number of download threads"
|
47
|
+
option :retry_count, type: :numeric, default: 5, desc: "Number of retry attempts for failed downloads"
|
48
|
+
option :retry_delay, type: :numeric, default: 1, desc: "Delay in seconds between retry attempts"
|
49
|
+
option :force, type: :boolean, default: false, desc: "Force download even if README was already downloaded"
|
50
|
+
def download_readme
|
51
|
+
config = Config.load
|
52
|
+
|
53
|
+
# Create a downloader instance
|
54
|
+
downloader = Downloader.new(
|
55
|
+
config,
|
56
|
+
"readme_downloader", # Placeholder username
|
57
|
+
thread_count: options[:threads],
|
58
|
+
retry_count: options[:retry_count],
|
59
|
+
retry_delay: options[:retry_delay]
|
60
|
+
)
|
61
|
+
|
62
|
+
# Call the download_readmes method in the Downloader class
|
63
|
+
result = downloader.download_readmes(force: options[:force])
|
64
|
+
|
65
|
+
puts "README download completed!"
|
66
|
+
puts "Successfully downloaded: #{result[:success]}"
|
67
|
+
puts "Failed or not found: #{result[:failed]}"
|
68
|
+
end
|
69
|
+
|
31
70
|
desc "config", "Configure star-dlp"
|
32
71
|
option :token, type: :string, desc: "GitHub API token"
|
33
72
|
option :output_dir, type: :string, desc: "Output directory for stars"
|
data/lib/star/dlp/downloader.rb
CHANGED
@@ -2,9 +2,12 @@
|
|
2
2
|
|
3
3
|
require "github_api"
|
4
4
|
require "json"
|
5
|
+
require "tempfile"
|
5
6
|
require "fileutils"
|
6
7
|
require "time"
|
7
8
|
require "base64"
|
9
|
+
require "thread"
|
10
|
+
require "open3"
|
8
11
|
|
9
12
|
module Star
|
10
13
|
module Dlp
|
@@ -12,10 +15,45 @@ module Star
|
|
12
15
|
attr_reader :config, :github, :username
|
13
16
|
|
14
17
|
LAST_REPO_FILE = "last_downloaded_repo.txt"
|
18
|
+
DOWNLOADED_READMES_FILE = "downloaded_readmes.txt"
|
19
|
+
DEFAULT_THREAD_COUNT = 16
|
20
|
+
DEFAULT_RETRY_COUNT = 5
|
21
|
+
DEFAULT_RETRY_DELAY = 1 # seconds
|
15
22
|
|
16
|
-
|
23
|
+
# Supported README formats in order of preference
|
24
|
+
README_FORMATS = [
|
25
|
+
"README.md",
|
26
|
+
"README.markdown",
|
27
|
+
"readme.md",
|
28
|
+
"README.org",
|
29
|
+
"README.rst",
|
30
|
+
"README.txt",
|
31
|
+
"README.rdoc",
|
32
|
+
"README.adoc",
|
33
|
+
"README",
|
34
|
+
"readme.org",
|
35
|
+
"readme.rst",
|
36
|
+
"readme.txt",
|
37
|
+
"readme.rdoc",
|
38
|
+
"readme.adoc",
|
39
|
+
"readme"
|
40
|
+
]
|
41
|
+
|
42
|
+
# Formats that need conversion to markdown
|
43
|
+
FORMATS_NEEDING_CONVERSION = {
|
44
|
+
".org" => "org",
|
45
|
+
".rst" => "rst",
|
46
|
+
".txt" => "txt",
|
47
|
+
"" => "txt" # For files without extension
|
48
|
+
}
|
49
|
+
|
50
|
+
def initialize(config, username, thread_count: DEFAULT_THREAD_COUNT, skip_readme: false, retry_count: DEFAULT_RETRY_COUNT, retry_delay: DEFAULT_RETRY_DELAY)
|
17
51
|
@config = config
|
18
52
|
@username = username
|
53
|
+
@thread_count = thread_count
|
54
|
+
@skip_readme = skip_readme
|
55
|
+
@retry_count = retry_count
|
56
|
+
@retry_delay = retry_delay
|
19
57
|
|
20
58
|
# Initialize GitHub API client with the special Accept header for starred_at field
|
21
59
|
options = {
|
@@ -98,14 +136,19 @@ module Star
|
|
98
136
|
|
99
137
|
puts "Found #{new_stars.size} new starred repositories to download"
|
100
138
|
|
101
|
-
# Save new stars
|
139
|
+
# Save new stars using multiple threads
|
102
140
|
if new_stars.any?
|
103
|
-
puts "Downloading new repositories:"
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
141
|
+
puts "Downloading new repositories using #{@thread_count} threads:"
|
142
|
+
|
143
|
+
# Process stars with multithreading
|
144
|
+
process_items_with_threads(
|
145
|
+
new_stars,
|
146
|
+
->(star) { get_repo_full_name(star) },
|
147
|
+
->(star) {
|
148
|
+
save_star_as_json(star)
|
149
|
+
save_star_as_markdown(star)
|
150
|
+
}
|
151
|
+
)
|
109
152
|
|
110
153
|
puts "Download completed successfully!"
|
111
154
|
else
|
@@ -119,8 +162,421 @@ module Star
|
|
119
162
|
end
|
120
163
|
end
|
121
164
|
|
165
|
+
# Download READMEs for all repositories from JSON files
|
166
|
+
def download_readmes(force: false)
|
167
|
+
puts "Downloading READMEs for repositories from JSON files"
|
168
|
+
|
169
|
+
# File to track repositories with downloaded READMEs
|
170
|
+
downloaded_readmes_file = File.join(config.output_dir, DOWNLOADED_READMES_FILE)
|
171
|
+
|
172
|
+
# Load list of repositories with already downloaded READMEs
|
173
|
+
downloaded_repos = Set.new
|
174
|
+
if File.exist?(downloaded_readmes_file) && !force
|
175
|
+
File.readlines(downloaded_readmes_file).each do |line|
|
176
|
+
downloaded_repos.add(line.strip)
|
177
|
+
end
|
178
|
+
puts "Found #{downloaded_repos.size} repositories with already downloaded READMEs"
|
179
|
+
end
|
180
|
+
|
181
|
+
# Find all JSON files in the json directory
|
182
|
+
json_files = Dir.glob(File.join(config.json_dir, "**", "*.json"))
|
183
|
+
puts "Found #{json_files.size} JSON files"
|
184
|
+
|
185
|
+
# Extract repository names from JSON files
|
186
|
+
repos_to_process = []
|
187
|
+
repo_dates = {} # Store starred_at dates for repositories
|
188
|
+
|
189
|
+
json_files.each do |json_file|
|
190
|
+
begin
|
191
|
+
data = JSON.parse(File.read(json_file))
|
192
|
+
|
193
|
+
# Extract repository full name from JSON data
|
194
|
+
repo_full_name = nil
|
195
|
+
starred_at = nil
|
196
|
+
|
197
|
+
if data.is_a?(Hash) && data["repo"] && data["repo"]["full_name"]
|
198
|
+
repo_full_name = data["repo"]["full_name"]
|
199
|
+
starred_at = data["starred_at"] if data.key?("starred_at")
|
200
|
+
elsif data.is_a?(Hash) && data["full_name"]
|
201
|
+
repo_full_name = data["full_name"]
|
202
|
+
starred_at = data["starred_at"] if data.key?("starred_at")
|
203
|
+
elsif File.basename(json_file) =~ /(\d{8})\.(.+)\.json$/
|
204
|
+
# Try to extract from filename (format: YYYYMMDD.owner.repo.json)
|
205
|
+
date_str = $1
|
206
|
+
parts = $2.split('.')
|
207
|
+
if parts.size >= 2
|
208
|
+
repo_full_name = "#{parts[0]}/#{parts[1]}"
|
209
|
+
# Convert YYYYMMDD to ISO date format
|
210
|
+
if date_str =~ /^(\d{4})(\d{2})(\d{2})$/
|
211
|
+
starred_at = "#{$1}-#{$2}-#{$3}T00:00:00Z"
|
212
|
+
end
|
213
|
+
end
|
214
|
+
end
|
215
|
+
|
216
|
+
# Skip if we couldn't determine the repository name or if README was already downloaded
|
217
|
+
next if repo_full_name.nil?
|
218
|
+
next if downloaded_repos.include?(repo_full_name) && !force
|
219
|
+
|
220
|
+
repos_to_process << repo_full_name
|
221
|
+
# Store the starred_at date if available
|
222
|
+
repo_dates[repo_full_name] = starred_at if starred_at
|
223
|
+
rescue JSON::ParserError => e
|
224
|
+
puts "Error parsing JSON file #{json_file}: #{e.message}"
|
225
|
+
end
|
226
|
+
end
|
227
|
+
|
228
|
+
puts "Found #{repos_to_process.size} repositories that need README downloads"
|
229
|
+
|
230
|
+
# Create a mutex for thread-safe file writing
|
231
|
+
mutex = Mutex.new
|
232
|
+
success_count = 0
|
233
|
+
failed_count = 0
|
234
|
+
|
235
|
+
# Process repositories with multithreading
|
236
|
+
result = process_items_with_threads(
|
237
|
+
repos_to_process,
|
238
|
+
->(repo) { repo }, # Item name is the repo name itself
|
239
|
+
->(repo_full_name) {
|
240
|
+
# Try to download README
|
241
|
+
readme_result = fetch_readme(repo_full_name)
|
242
|
+
|
243
|
+
if readme_result && readme_result[:content]
|
244
|
+
# Get starred_at date if available, or use current date as fallback
|
245
|
+
date = nil
|
246
|
+
if repo_dates.key?(repo_full_name) && repo_dates[repo_full_name]
|
247
|
+
begin
|
248
|
+
date = Time.parse(repo_dates[repo_full_name])
|
249
|
+
rescue
|
250
|
+
date = Time.now
|
251
|
+
end
|
252
|
+
else
|
253
|
+
date = Time.now
|
254
|
+
end
|
255
|
+
|
256
|
+
# Create markdown file path
|
257
|
+
md_filepath = get_markdown_filepath(repo_full_name, date)
|
258
|
+
|
259
|
+
mutex.synchronize do
|
260
|
+
# Check if file exists
|
261
|
+
if File.exist?(md_filepath)
|
262
|
+
# Append README content to existing file
|
263
|
+
File.open(md_filepath, 'a') do |file|
|
264
|
+
file.puts "\n\n## README"
|
265
|
+
file.puts "\n*Format: #{readme_result[:format]}*\n" if readme_result[:format] != "markdown"
|
266
|
+
file.puts "\n#{readme_result[:content]}\n"
|
267
|
+
end
|
268
|
+
else
|
269
|
+
# Create new file with repository information and README
|
270
|
+
content = <<~MARKDOWN
|
271
|
+
# #{repo_full_name}
|
272
|
+
|
273
|
+
- **Downloaded at**: #{Time.now.iso8601}
|
274
|
+
- **Starred at**: #{date.iso8601}
|
275
|
+
|
276
|
+
[View on GitHub](https://github.com/#{repo_full_name})
|
277
|
+
|
278
|
+
## README
|
279
|
+
MARKDOWN
|
280
|
+
|
281
|
+
# Add format note if not markdown
|
282
|
+
content += "\n*Format: #{readme_result[:format]}*\n" if readme_result[:format] != "markdown"
|
283
|
+
|
284
|
+
# Add README content
|
285
|
+
content += "\n#{readme_result[:content]}\n"
|
286
|
+
|
287
|
+
File.write(md_filepath, content)
|
288
|
+
end
|
289
|
+
|
290
|
+
# Add to downloaded repositories list
|
291
|
+
File.open(downloaded_readmes_file, 'a') do |file|
|
292
|
+
file.puts repo_full_name
|
293
|
+
end
|
294
|
+
|
295
|
+
success_count += 1
|
296
|
+
end
|
297
|
+
|
298
|
+
true
|
299
|
+
else
|
300
|
+
mutex.synchronize do
|
301
|
+
puts "No README found for #{repo_full_name}"
|
302
|
+
failed_count += 1
|
303
|
+
end
|
304
|
+
true # Mark as success even if README not found to avoid retries
|
305
|
+
end
|
306
|
+
}
|
307
|
+
)
|
308
|
+
|
309
|
+
puts "README download completed!"
|
310
|
+
puts "Successfully downloaded: #{success_count}"
|
311
|
+
puts "Failed or not found: #{failed_count}"
|
312
|
+
|
313
|
+
return {
|
314
|
+
total: repos_to_process.size,
|
315
|
+
success: success_count,
|
316
|
+
failed: failed_count
|
317
|
+
}
|
318
|
+
end
|
319
|
+
|
320
|
+
# Fetch README content from GitHub
|
321
|
+
# Returns a hash with :content and :format keys, or nil if not found
|
322
|
+
def fetch_readme(repo_full_name)
|
323
|
+
# Try each README format in order
|
324
|
+
README_FORMATS.each do |readme_path|
|
325
|
+
begin
|
326
|
+
# Get README content using GitHub API
|
327
|
+
response = github.repos.contents.get(
|
328
|
+
user: repo_full_name.split('/').first,
|
329
|
+
repo: repo_full_name.split('/').last,
|
330
|
+
path: readme_path
|
331
|
+
)
|
332
|
+
|
333
|
+
# Decode content from Base64
|
334
|
+
if response.content && response.encoding == 'base64'
|
335
|
+
content = Base64.decode64(response.content).force_encoding('UTF-8')
|
336
|
+
|
337
|
+
# Get file extension
|
338
|
+
ext = File.extname(readme_path).downcase
|
339
|
+
|
340
|
+
# Check if we need to convert the content
|
341
|
+
if FORMATS_NEEDING_CONVERSION.key?(ext)
|
342
|
+
format = FORMATS_NEEDING_CONVERSION[ext]
|
343
|
+
puts "Converting #{readme_path} from #{format} to markdown for #{repo_full_name}"
|
344
|
+
|
345
|
+
# Create a temporary file with the content
|
346
|
+
temp_file = Tempfile.new(['readme', ".#{format}"])
|
347
|
+
begin
|
348
|
+
temp_file.write(content)
|
349
|
+
temp_file.close
|
350
|
+
|
351
|
+
# Use pandoc to convert to markdown
|
352
|
+
markdown_content, status = convert_to_markdown(temp_file.path, format)
|
353
|
+
|
354
|
+
if status.success?
|
355
|
+
return { content: markdown_content, format: format }
|
356
|
+
else
|
357
|
+
puts "Pandoc conversion failed for #{repo_full_name}, using original content"
|
358
|
+
return { content: content, format: format }
|
359
|
+
end
|
360
|
+
ensure
|
361
|
+
temp_file.unlink
|
362
|
+
end
|
363
|
+
else
|
364
|
+
# Already markdown, no conversion needed
|
365
|
+
return { content: content, format: "markdown" }
|
366
|
+
end
|
367
|
+
end
|
368
|
+
rescue Github::Error::NotFound
|
369
|
+
# Try next format
|
370
|
+
next
|
371
|
+
rescue => e
|
372
|
+
puts "Error fetching #{readme_path} for #{repo_full_name}: #{e.message}"
|
373
|
+
next
|
374
|
+
end
|
375
|
+
end
|
376
|
+
|
377
|
+
# No README found in predefined formats, check for any readme-like file in the root directory
|
378
|
+
begin
|
379
|
+
# Get repository contents
|
380
|
+
contents = github.repos.contents.get(
|
381
|
+
user: repo_full_name.split('/').first,
|
382
|
+
repo: repo_full_name.split('/').last,
|
383
|
+
path: "" # Root directory
|
384
|
+
)
|
385
|
+
|
386
|
+
# Look for any file with name matching /readme/i
|
387
|
+
readme_file = contents.find { |item| item.type == "file" && item.name =~ /readme/i }
|
388
|
+
|
389
|
+
if readme_file
|
390
|
+
puts "Found alternative README file: #{readme_file.name} for #{repo_full_name}"
|
391
|
+
|
392
|
+
# Get README content
|
393
|
+
readme_content = github.repos.contents.get(
|
394
|
+
user: repo_full_name.split('/').first,
|
395
|
+
repo: repo_full_name.split('/').last,
|
396
|
+
path: readme_file.name
|
397
|
+
)
|
398
|
+
|
399
|
+
# Decode content from Base64
|
400
|
+
if readme_content.content && readme_content.encoding == 'base64'
|
401
|
+
content = Base64.decode64(readme_content.content).force_encoding('UTF-8')
|
402
|
+
|
403
|
+
# Get file extension
|
404
|
+
ext = File.extname(readme_file.name).downcase
|
405
|
+
|
406
|
+
# Check if we need to convert the content
|
407
|
+
if FORMATS_NEEDING_CONVERSION.key?(ext)
|
408
|
+
format = FORMATS_NEEDING_CONVERSION[ext]
|
409
|
+
puts "Converting #{readme_file.name} from #{format} to markdown for #{repo_full_name}"
|
410
|
+
|
411
|
+
# Create a temporary file with the content
|
412
|
+
temp_file = Tempfile.new(['readme', ".#{format}"])
|
413
|
+
begin
|
414
|
+
temp_file.write(content)
|
415
|
+
temp_file.close
|
416
|
+
|
417
|
+
# Use pandoc to convert to markdown
|
418
|
+
markdown_content, status = convert_to_markdown(temp_file.path, format)
|
419
|
+
|
420
|
+
if status.success?
|
421
|
+
return { content: markdown_content, format: format }
|
422
|
+
else
|
423
|
+
puts "Pandoc conversion failed for #{repo_full_name}, using original content"
|
424
|
+
return { content: content, format: format }
|
425
|
+
end
|
426
|
+
ensure
|
427
|
+
temp_file.unlink
|
428
|
+
end
|
429
|
+
else
|
430
|
+
# Determine format based on extension or default to txt
|
431
|
+
format = ext.empty? ? "txt" : ext[1..]
|
432
|
+
# Use markdown format if extension suggests it's already markdown
|
433
|
+
format = "markdown" if [".md", ".markdown"].include?(ext)
|
434
|
+
|
435
|
+
return { content: content, format: format }
|
436
|
+
end
|
437
|
+
end
|
438
|
+
end
|
439
|
+
rescue => e
|
440
|
+
puts "Error checking root directory for README-like files for #{repo_full_name}: #{e.message}"
|
441
|
+
end
|
442
|
+
|
443
|
+
# No README found in any format
|
444
|
+
nil
|
445
|
+
end
|
446
|
+
|
447
|
+
# Convert content from a given format to markdown using pandoc
|
448
|
+
def convert_to_markdown(file_path, format)
|
449
|
+
begin
|
450
|
+
# Check if pandoc is installed
|
451
|
+
version_output, status = Open3.capture2e("pandoc --version")
|
452
|
+
unless status.success?
|
453
|
+
puts "Warning: pandoc is not installed or not in PATH. Cannot convert non-markdown formats."
|
454
|
+
return [File.read(file_path), status]
|
455
|
+
end
|
456
|
+
|
457
|
+
# Use pandoc to convert to markdown
|
458
|
+
output, status = Open3.capture2e("pandoc", "-f", format, "-t", "markdown", file_path)
|
459
|
+
|
460
|
+
if status.success?
|
461
|
+
return [output, status]
|
462
|
+
else
|
463
|
+
puts "Pandoc conversion failed: #{output}"
|
464
|
+
return [File.read(file_path), status]
|
465
|
+
end
|
466
|
+
rescue => e
|
467
|
+
puts "Error during conversion: #{e.message}"
|
468
|
+
return [File.read(file_path), OpenStruct.new(success?: false)]
|
469
|
+
end
|
470
|
+
end
|
471
|
+
|
122
472
|
private
|
123
473
|
|
474
|
+
# Process a list of items using multiple threads
|
475
|
+
# items: Array of items to process
|
476
|
+
# name_proc: Proc to get item name for logging
|
477
|
+
# process_proc: Proc to process each item
|
478
|
+
def process_items_with_threads(items, name_proc, process_proc)
|
479
|
+
return if items.empty?
|
480
|
+
|
481
|
+
# Create a thread-safe queue for the items
|
482
|
+
queue = Queue.new
|
483
|
+
items.each { |item| queue << item }
|
484
|
+
|
485
|
+
# Create a mutex for thread-safe output
|
486
|
+
mutex = Mutex.new
|
487
|
+
|
488
|
+
# Create a progress counter
|
489
|
+
total = items.size
|
490
|
+
completed = 0
|
491
|
+
|
492
|
+
# Create and start the worker threads
|
493
|
+
threads = Array.new(@thread_count) do
|
494
|
+
Thread.new do
|
495
|
+
until queue.empty?
|
496
|
+
# Try to get an item from the queue (non-blocking)
|
497
|
+
item = queue.pop(true) rescue nil
|
498
|
+
break unless item
|
499
|
+
|
500
|
+
# Get the item name for logging
|
501
|
+
item_name = name_proc.call(item)
|
502
|
+
|
503
|
+
# Process the item with retry mechanism
|
504
|
+
success = false
|
505
|
+
retry_count = 0
|
506
|
+
|
507
|
+
until success || retry_count >= @retry_count
|
508
|
+
begin
|
509
|
+
# Process the item
|
510
|
+
process_proc.call(item)
|
511
|
+
success = true
|
512
|
+
rescue => e
|
513
|
+
retry_count += 1
|
514
|
+
|
515
|
+
# Log the error and retry information
|
516
|
+
mutex.synchronize do
|
517
|
+
puts " Error processing #{item_name}: #{e.message}"
|
518
|
+
if retry_count < @retry_count
|
519
|
+
puts " Retrying in #{@retry_delay} seconds (attempt #{retry_count + 1}/#{@retry_count})..."
|
520
|
+
else
|
521
|
+
puts " Failed to process after #{@retry_count} attempts."
|
522
|
+
end
|
523
|
+
end
|
524
|
+
|
525
|
+
# Wait before retrying
|
526
|
+
sleep(@retry_delay)
|
527
|
+
end
|
528
|
+
end
|
529
|
+
|
530
|
+
# Update progress
|
531
|
+
mutex.synchronize do
|
532
|
+
completed += 1
|
533
|
+
puts " [#{completed}/#{total}] Processed: #{item_name} (#{(completed.to_f / total * 100).round(1)}%)"
|
534
|
+
end
|
535
|
+
end
|
536
|
+
end
|
537
|
+
end
|
538
|
+
|
539
|
+
# Wait for all threads to complete
|
540
|
+
threads.each(&:join)
|
541
|
+
|
542
|
+
return {
|
543
|
+
total: total,
|
544
|
+
completed: completed
|
545
|
+
}
|
546
|
+
end
|
547
|
+
|
548
|
+
# Get the markdown file path for a repository
|
549
|
+
def get_markdown_filepath(repo_full_name, date = Time.now)
|
550
|
+
# Create directory structure based on date: markdown/YYYY/MM/
|
551
|
+
year_dir = date.strftime("%Y")
|
552
|
+
month_dir = date.strftime("%m")
|
553
|
+
target_dir = File.join(config.markdown_dir, year_dir, month_dir)
|
554
|
+
FileUtils.mkdir_p(target_dir) unless Dir.exist?(target_dir)
|
555
|
+
|
556
|
+
# Format filename: YYYYMMDD.repo_owner.repo_name.md
|
557
|
+
date_str = date.strftime("%Y%m%d")
|
558
|
+
repo_name = repo_full_name.gsub('/', '.')
|
559
|
+
filename = "#{date_str}.#{repo_name}.md"
|
560
|
+
|
561
|
+
File.join(target_dir, filename)
|
562
|
+
end
|
563
|
+
|
564
|
+
# Get the JSON file path for a repository
|
565
|
+
def get_json_filepath(repo_full_name, date = Time.now)
|
566
|
+
# Create directory structure based on date: json/YYYY/MM/
|
567
|
+
year_dir = date.strftime("%Y")
|
568
|
+
month_dir = date.strftime("%m")
|
569
|
+
target_dir = File.join(config.json_dir, year_dir, month_dir)
|
570
|
+
FileUtils.mkdir_p(target_dir) unless Dir.exist?(target_dir)
|
571
|
+
|
572
|
+
# Format filename: YYYYMMDD.repo_owner.repo_name.json
|
573
|
+
date_str = date.strftime("%Y%m%d")
|
574
|
+
repo_name = repo_full_name.gsub('/', '.')
|
575
|
+
filename = "#{date_str}.#{repo_name}.json"
|
576
|
+
|
577
|
+
File.join(target_dir, filename)
|
578
|
+
end
|
579
|
+
|
124
580
|
def get_last_repo_name
|
125
581
|
last_repo_file = File.join(config.output_dir, LAST_REPO_FILE)
|
126
582
|
return nil unless File.exist?(last_repo_file)
|
@@ -133,25 +589,19 @@ module Star
|
|
133
589
|
File.write(last_repo_file, repo_name)
|
134
590
|
end
|
135
591
|
|
136
|
-
|
137
592
|
def save_star_as_json(star)
|
138
593
|
star_data = star.to_hash
|
139
594
|
|
140
595
|
# Get starred_at date or use current date as fallback
|
141
596
|
starred_at = star.respond_to?(:starred_at) ? Time.parse(star.starred_at) : Time.now
|
142
597
|
|
143
|
-
#
|
144
|
-
|
145
|
-
month_dir = starred_at.strftime("%m")
|
146
|
-
target_dir = File.join(config.json_dir, year_dir, month_dir)
|
147
|
-
FileUtils.mkdir_p(target_dir) unless Dir.exist?(target_dir)
|
598
|
+
# Get the repository name
|
599
|
+
repo_full_name = get_repo_full_name(star)
|
148
600
|
|
149
|
-
#
|
150
|
-
|
151
|
-
repo_name = get_repo_full_name(star).gsub('/', '.')
|
152
|
-
filename = "#{date_str}.#{repo_name}.json"
|
601
|
+
# Get the JSON file path
|
602
|
+
filepath = get_json_filepath(repo_full_name, starred_at)
|
153
603
|
|
154
|
-
|
604
|
+
# Write the JSON file
|
155
605
|
File.write(filepath, JSON.pretty_generate(star_data))
|
156
606
|
end
|
157
607
|
|
@@ -159,19 +609,14 @@ module Star
|
|
159
609
|
# Get starred_at date or use current date as fallback
|
160
610
|
starred_at = star.respond_to?(:starred_at) ? Time.parse(star.starred_at) : Time.now
|
161
611
|
|
162
|
-
#
|
163
|
-
year_dir = starred_at.strftime("%Y")
|
164
|
-
month_dir = starred_at.strftime("%m")
|
165
|
-
target_dir = File.join(config.markdown_dir, year_dir, month_dir)
|
166
|
-
FileUtils.mkdir_p(target_dir) unless Dir.exist?(target_dir)
|
167
|
-
|
168
|
-
# Format filename: YYYYMMDD.username.repo_name.md
|
169
|
-
date_str = starred_at.strftime("%Y%m%d")
|
612
|
+
# Get the repository name
|
170
613
|
repo_full_name = get_repo_full_name(star)
|
171
|
-
repo_name = repo_full_name.gsub('/', '.')
|
172
|
-
filename = "#{date_str}.#{repo_name}.md"
|
173
614
|
|
174
|
-
|
615
|
+
# Get the markdown file path
|
616
|
+
filepath = get_markdown_filepath(repo_full_name, starred_at)
|
617
|
+
|
618
|
+
# Skip if file already exists
|
619
|
+
return if File.exist?(filepath)
|
175
620
|
|
176
621
|
# Include starred_at in the markdown
|
177
622
|
starred_at_str = star.respond_to?(:starred_at) ? star.starred_at : "N/A"
|
@@ -196,10 +641,17 @@ module Star
|
|
196
641
|
#{(get_topics(star) || []).map { |topic| "- #{topic}" }.join("\n")}
|
197
642
|
MARKDOWN
|
198
643
|
|
199
|
-
# Try to fetch README.md content
|
200
|
-
|
201
|
-
|
202
|
-
|
644
|
+
# Try to fetch README.md content if not skipped
|
645
|
+
unless @skip_readme
|
646
|
+
readme_result = fetch_readme(repo_full_name)
|
647
|
+
if readme_result && readme_result[:content]
|
648
|
+
content += "\n\n## README"
|
649
|
+
# Add format note if not markdown
|
650
|
+
content += "\n*Format: #{readme_result[:format]}*\n" if readme_result[:format] != "markdown"
|
651
|
+
content += "\n#{readme_result[:content]}\n"
|
652
|
+
else
|
653
|
+
content += "\n\n## Description\n\n#{get_description(star)}\n"
|
654
|
+
end
|
203
655
|
else
|
204
656
|
content += "\n\n## Description\n\n#{get_description(star)}\n"
|
205
657
|
end
|
@@ -297,63 +749,6 @@ module Star
|
|
297
749
|
[]
|
298
750
|
end
|
299
751
|
end
|
300
|
-
|
301
|
-
# Fetch README.md content from GitHub
|
302
|
-
def fetch_readme(repo_full_name)
|
303
|
-
begin
|
304
|
-
# Get README content using GitHub API
|
305
|
-
response = github.repos.contents.get(
|
306
|
-
user: repo_full_name.split('/').first,
|
307
|
-
repo: repo_full_name.split('/').last,
|
308
|
-
path: 'README.md'
|
309
|
-
)
|
310
|
-
|
311
|
-
# Decode content from Base64
|
312
|
-
if response.content && response.encoding == 'base64'
|
313
|
-
return Base64.decode64(response.content).force_encoding('UTF-8')
|
314
|
-
end
|
315
|
-
rescue Github::Error::NotFound
|
316
|
-
# Try README.markdown if README.md not found
|
317
|
-
begin
|
318
|
-
response = github.repos.contents.get(
|
319
|
-
user: repo_full_name.split('/').first,
|
320
|
-
repo: repo_full_name.split('/').last,
|
321
|
-
path: 'README.markdown'
|
322
|
-
)
|
323
|
-
|
324
|
-
if response.content && response.encoding == 'base64'
|
325
|
-
return Base64.decode64(response.content).force_encoding('UTF-8')
|
326
|
-
end
|
327
|
-
rescue Github::Error::NotFound
|
328
|
-
# Try readme.md (lowercase) if previous attempts failed
|
329
|
-
begin
|
330
|
-
response = github.repos.contents.get(
|
331
|
-
user: repo_full_name.split('/').first,
|
332
|
-
repo: repo_full_name.split('/').last,
|
333
|
-
path: 'readme.md'
|
334
|
-
)
|
335
|
-
|
336
|
-
if response.content && response.encoding == 'base64'
|
337
|
-
return Base64.decode64(response.content).force_encoding('UTF-8')
|
338
|
-
end
|
339
|
-
rescue Github::Error::NotFound
|
340
|
-
# README not found
|
341
|
-
return nil
|
342
|
-
rescue => e
|
343
|
-
puts "Error fetching lowercase readme.md for #{repo_full_name}: #{e.message}"
|
344
|
-
return nil
|
345
|
-
end
|
346
|
-
rescue => e
|
347
|
-
puts "Error fetching README.markdown for #{repo_full_name}: #{e.message}"
|
348
|
-
return nil
|
349
|
-
end
|
350
|
-
rescue => e
|
351
|
-
puts "Error fetching README.md for #{repo_full_name}: #{e.message}"
|
352
|
-
return nil
|
353
|
-
end
|
354
|
-
|
355
|
-
nil
|
356
|
-
end
|
357
752
|
end
|
358
753
|
end
|
359
754
|
end
|
data/lib/star/dlp/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: star-dlp
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Liu Xiang
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2025-03-
|
11
|
+
date: 2025-03-18 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: github_api
|