solo-doc 0.0.3 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,61 +1,84 @@
1
1
  # Solo-Doc CLI
2
2
 
3
- Solo-Doc is a powerful Node.js CLI tool designed to crawl complex documentation sites and convert them into a single, hierarchically structured Markdown file.
3
+ Solo-Doc 是一个强大的 Node.js CLI 工具,旨在爬取复杂的文档站点并将其转换为单一的、保留层级结构的 Markdown 文件。
4
4
 
5
- **Name Origin**: "Solo" represents the capability to consolidate multiple documentation pages into a "single" (solo) file, and "Doc" stands for documentation.
5
+ **命名由来**:"Solo" 代表将多个文档页面整合为“单一”(solo)文件的能力,"Doc" 代表文档。
6
6
 
7
- ## Features
7
+ ## ✨ 功能特性
8
8
 
9
- - **Multi-Strategy Support**: Specialized strategies for different documentation frameworks:
10
- - **OCP (Red Hat OpenShift)**: Optimised for static single-page HTML documentation.
11
- - **ACP (Alauda Container Platform)**: Optimised for dynamic, client-side rendered (Rspress-based) documentation using Puppeteer.
12
- - **Hierarchy Preservation**: Maintains the original directory structure (1, 1.1, 1.1.1...) of the documentation.
13
- - **Clean Output**: Removes navigation bars, sidebars, headers, and footers, keeping only the relevant content.
14
- - **Single File Output**: Merges all crawled pages into one comprehensive Markdown file.
9
+ - **🧠 智能探测**:
10
+ - **自动策略识别**: 根据 URL 自动检测文档类型(Red Hat OpenShift Alauda)。
11
+ - **自动命名**: 基于文档路径智能生成输出文件名(例如 `acp-building_application.md`),无需手动指定。
12
+ - **🏗 多策略支持**: 针对不同文档框架的专用策略:
13
+ - **OCP (Red Hat OpenShift)**: 针对静态单页 HTML 文档进行了优化。
14
+ - **ACP (Alauda Container Platform)**: 针对使用 Puppeteer 的动态客户端渲染(基于 Rspress)文档进行了优化。
15
+ - **🌲 保持层级结构**: 完整保留文档的原始目录结构(1, 1.1, 1.1.1...)。
16
+ - **✨ 纯净输出**: 移除导航栏、侧边栏、页眉和页脚,仅保留核心内容。
17
+ - **📄 单文件输出**: 将所有爬取的页面合并为一个完整的 Markdown 文件。
15
18
 
16
- ## Installation
19
+ ## 📦 安装
17
20
 
18
- ### From NPM (Recommended)
21
+ ### 通过 NPM 安装(推荐)
19
22
 
20
- Once published, you can install the tool globally:
23
+ 你可以全局安装此工具:
21
24
 
22
25
  ```bash
23
26
  npm install -g solo-doc
24
27
  ```
25
28
 
26
- ## Usage
29
+ ## 🚀 使用指南
27
30
 
28
- Once installed globally, you can run the `solo-doc` command from any terminal window.
31
+ 全局安装后,你可以在任何终端窗口运行 `solo-doc` 命令。
29
32
 
30
- ### 1. Crawl OpenShift (OCP) Docs
33
+ ### 基础用法(自动探测)
31
34
 
32
- For Red Hat OpenShift documentation (HTML Single format):
35
+ 只需提供 URL。Solo-Doc 会自动识别站点类型并生成有意义的文件名。
33
36
 
34
37
  ```bash
35
- solo-doc ocp "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/building_applications/index" -o openshift_docs.md
38
+ # 爬取 Alauda 文档
39
+ # 输出文件: acp-building_application.md
40
+ solo-doc "https://docs.alauda.io/container_platform/4.2/developer/building_application/index.html"
41
+
42
+ # 爬取 Red Hat 文档
43
+ # 输出文件: ocp-building_applications.md
44
+ solo-doc "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/building_applications/index"
36
45
  ```
37
46
 
38
- ### 2. Crawl Alauda (ACP) Docs
47
+ ### 📝 自定义输出文件名
39
48
 
40
- For Alauda Container Platform documentation (Rspress format):
49
+ 使用 `-o` 参数指定自定义输出路径。
41
50
 
42
51
  ```bash
43
- solo-doc acp "https://docs.alauda.io/container_platform/4.2/developer/building_application/index.html" -o alauda_docs.md
52
+ solo-doc "https://docs.alauda.io/..." -o my-manual.md
44
53
  ```
45
54
 
46
- **Options:**
47
- - `-o, --output <path>`: Specify output file path (default: `[strategy]-docs.md`).
48
- - `--limit <number>`: Limit the number of pages to crawl (useful for testing).
49
- - `--no-headless`: Run browser in visible mode (for ACP debugging).
55
+ ### 🔧 强制指定策略类型
56
+
57
+ 如果 URL 无法被自动识别(例如私有 IP 或自定义域名),你可以使用 `--type` 强制指定策略。
58
+
59
+ ```bash
60
+ # 针对私有部署强制使用 ACP 策略
61
+ solo-doc "http://10.1.2.3/docs/index.html" --type acp
62
+ ```
63
+
64
+ ## ⚙️ 选项参数
65
+
66
+ | 选项 | 描述 | 默认值 |
67
+ |--------|-------------|---------|
68
+ | `<url>` | 要爬取的文档 URL | (必填) |
69
+ | `-o, --output <path>` | 输出文件路径。如果省略,将根据 URL 自动生成文件名。 | `[type]-[path-segment].md` |
70
+ | `-t, --type <type>` | 强制指定策略类型 (`ocp` 或 `acp`)。 | 自动探测 |
71
+ | `--limit <number>` | 限制爬取的页面数量 (用于测试/调试)。 | 无限制 |
72
+ | `--no-headless` | 在可见模式下运行浏览器 (仅限 ACP,用于调试)。 | Headless (无头模式) |
50
73
 
51
- ## Requirements
74
+ ## ✅ 环境要求
52
75
 
53
- - Node.js >= 18
54
- - Google Chrome (for ACP crawling)
76
+ - Node.js >= 20
77
+ - Google Chrome (用于 ACP 爬取)
55
78
 
56
- ## Development
79
+ ## 💻 开发
57
80
 
58
81
  ```bash
59
- # Run in development mode
60
- npm run dev -- acp "url" ...
82
+ # 在开发模式下运行
83
+ npm run dev -- "https://docs.alauda.io/..."
61
84
  ```
@@ -8,37 +8,74 @@ const commander_1 = require("commander");
8
8
  const CrawlerContext_1 = require("../src/CrawlerContext");
9
9
  const OCPStrategy_1 = require("../src/strategies/OCPStrategy");
10
10
  const ACPStrategy_1 = require("../src/strategies/ACPStrategy");
11
+ const StrategyDetector_1 = require("../src/utils/StrategyDetector");
12
+ const chalk_1 = __importDefault(require("chalk"));
11
13
  const path_1 = __importDefault(require("path"));
12
14
  const program = new commander_1.Command();
13
15
  program
14
16
  .name('solo-doc')
15
17
  .description('CLI to crawl documentation sites and convert to single Markdown file')
16
- .version('1.0.0');
17
- program
18
- .command('ocp <url>')
19
- .description('Crawl Red Hat OpenShift documentation')
20
- .option('-o, --output <path>', 'Output file path', 'ocp-docs.md')
21
- .option('--limit <number>', 'Limit number of pages (for debug)', parseInt)
22
- .action(async (url, options) => {
23
- const strategy = new OCPStrategy_1.OCPStrategy();
24
- const context = new CrawlerContext_1.CrawlerContext(strategy);
25
- const outputPath = path_1.default.resolve(process.cwd(), options.output);
26
- await context.run(url, { output: outputPath, limit: options.limit });
27
- });
28
- program
29
- .command('acp <url>')
30
- .description('Crawl Alauda Container Platform documentation')
31
- .option('-o, --output <path>', 'Output file path', 'acp-docs.md')
18
+ .version('1.0.0')
19
+ .argument('<url>', 'The documentation URL to crawl')
20
+ .option('-t, --type <type>', 'Force specify strategy type (ocp, acp)')
21
+ .option('-o, --output <path>', 'Output file path')
32
22
  .option('--limit <number>', 'Limit number of pages (for debug)', parseInt)
33
- .option('--no-headless', 'Run in headful mode (show browser)')
23
+ .option('--no-headless', 'Run in headful mode (show browser) - Only for ACP')
34
24
  .action(async (url, options) => {
35
- const strategy = new ACPStrategy_1.ACPStrategy();
36
- const context = new CrawlerContext_1.CrawlerContext(strategy);
37
- const outputPath = path_1.default.resolve(process.cwd(), options.output);
38
- await context.run(url, {
39
- output: outputPath,
40
- limit: options.limit,
41
- headless: options.headless
42
- });
25
+ try {
26
+ // 1. Determine Strategy
27
+ let type = options.type;
28
+ if (!type) {
29
+ const detected = StrategyDetector_1.StrategyDetector.detect(url);
30
+ if (detected !== StrategyDetector_1.StrategyType.UNKNOWN) {
31
+ type = detected;
32
+ console.log(chalk_1.default.blue(`[Solo-Doc] Auto-detected strategy: ${type.toUpperCase()}`));
33
+ }
34
+ }
35
+ if (!type || (type !== 'ocp' && type !== 'acp')) {
36
+ console.error(chalk_1.default.red('Error: Could not detect documentation type.'));
37
+ console.error(chalk_1.default.yellow('Please use --type <ocp|acp> to specify the documentation type manually.'));
38
+ process.exit(1);
39
+ }
40
+ // 2. Instantiate Strategy
41
+ let strategy;
42
+ // Helper function to generate default filename from URL
43
+ const generateDefaultFilename = (urlStr, typePrefix) => {
44
+ try {
45
+ const u = new URL(urlStr);
46
+ // Get the last path segment that isn't 'index.html' or 'index' or empty
47
+ const segments = u.pathname.split('/').filter(s => s && s !== 'index.html' && s !== 'index');
48
+ const lastSegment = segments.length > 0 ? segments[segments.length - 1] : 'docs';
49
+ // Sanitize filename
50
+ const safeName = lastSegment.replace(/[^a-zA-Z0-9-_]/g, '_');
51
+ return `${typePrefix}-${safeName}.md`;
52
+ }
53
+ catch (e) {
54
+ return `${typePrefix}-docs.md`;
55
+ }
56
+ };
57
+ let defaultOutput;
58
+ if (type === 'ocp' || type === StrategyDetector_1.StrategyType.OCP) {
59
+ strategy = new OCPStrategy_1.OCPStrategy();
60
+ defaultOutput = generateDefaultFilename(url, 'ocp');
61
+ }
62
+ else {
63
+ strategy = new ACPStrategy_1.ACPStrategy();
64
+ defaultOutput = generateDefaultFilename(url, 'acp');
65
+ }
66
+ // 3. Prepare Context
67
+ const context = new CrawlerContext_1.CrawlerContext(strategy);
68
+ const outputPath = path_1.default.resolve(process.cwd(), options.output || defaultOutput);
69
+ // 4. Run
70
+ await context.run(url, {
71
+ output: outputPath,
72
+ limit: options.limit,
73
+ headless: options.headless
74
+ });
75
+ }
76
+ catch (error) {
77
+ console.error(chalk_1.default.red(`[Solo-Doc] Failed: ${error.message}`));
78
+ process.exit(1);
79
+ }
43
80
  });
44
81
  program.parse(process.argv);
@@ -47,6 +47,23 @@ class OCPStrategy {
47
47
  this.name = 'OCP (Red Hat OpenShift)';
48
48
  }
49
49
  async execute(url, options) {
50
+ // Optimisation: Try to convert multi-page URL (/html/) to single-page URL (/html-single/)
51
+ // Example: .../html/building_applications/index -> .../html-single/building_applications/index
52
+ if (url.includes('/html/') && !url.includes('/html-single/')) {
53
+ const singlePageUrl = url.replace('/html/', '/html-single/');
54
+ console.log(chalk_1.default.blue(`[OCP] Detected multi-page URL. Attempting to switch to single-page version for better results...`));
55
+ console.log(chalk_1.default.gray(`Original: ${url}`));
56
+ console.log(chalk_1.default.cyan(`Optimized: ${singlePageUrl}`));
57
+ try {
58
+ // Verify if the single page exists
59
+ await axios_1.default.head(singlePageUrl);
60
+ url = singlePageUrl;
61
+ console.log(chalk_1.default.green(`[OCP] Successfully switched to single-page version.`));
62
+ }
63
+ catch (e) {
64
+ console.log(chalk_1.default.yellow(`[OCP] Single-page version not found. Falling back to original URL.`));
65
+ }
66
+ }
50
67
  const spinner = (0, ora_1.default)('Fetching OCP content...').start();
51
68
  try {
52
69
  // 1. Fetch the single page HTML
@@ -0,0 +1,32 @@
1
+ "use strict";
2
+ Object.defineProperty(exports, "__esModule", { value: true });
3
+ exports.StrategyDetector = exports.StrategyType = void 0;
4
+ var StrategyType;
5
+ (function (StrategyType) {
6
+ StrategyType["OCP"] = "ocp";
7
+ StrategyType["ACP"] = "acp";
8
+ StrategyType["UNKNOWN"] = "unknown";
9
+ })(StrategyType || (exports.StrategyType = StrategyType = {}));
10
+ class StrategyDetector {
11
+ static detect(url) {
12
+ try {
13
+ // Add protocol if missing to ensure URL parsing works
14
+ if (!url.startsWith('http://') && !url.startsWith('https://')) {
15
+ url = 'https://' + url;
16
+ }
17
+ const urlObj = new URL(url);
18
+ const hostname = urlObj.hostname;
19
+ if (hostname.includes('redhat.com') || hostname.includes('openshift.com')) {
20
+ return StrategyType.OCP;
21
+ }
22
+ if (hostname.includes('alauda.io') || hostname.includes('alauda.cn')) {
23
+ return StrategyType.ACP;
24
+ }
25
+ return StrategyType.UNKNOWN;
26
+ }
27
+ catch (e) {
28
+ return StrategyType.UNKNOWN;
29
+ }
30
+ }
31
+ }
32
+ exports.StrategyDetector = StrategyDetector;
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "solo-doc",
3
- "version": "0.0.3",
3
+ "version": "0.1.0",
4
4
  "main": "dist/bin/solo-doc.js",
5
5
  "bin": {
6
6
  "solo-doc": "dist/bin/solo-doc.js"