@gulibs/safe-coder 0.0.24 → 0.0.25
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +351 -15
- package/dist/documentation/checkpoint-manager.d.ts +38 -0
- package/dist/documentation/checkpoint-manager.d.ts.map +1 -0
- package/dist/documentation/checkpoint-manager.js +101 -0
- package/dist/documentation/checkpoint-manager.js.map +1 -0
- package/dist/documentation/doc-crawler.d.ts +67 -1
- package/dist/documentation/doc-crawler.d.ts.map +1 -1
- package/dist/documentation/doc-crawler.js +677 -150
- package/dist/documentation/doc-crawler.js.map +1 -1
- package/dist/documentation/llms-txt/detector.d.ts +31 -0
- package/dist/documentation/llms-txt/detector.d.ts.map +1 -0
- package/dist/documentation/llms-txt/detector.js +77 -0
- package/dist/documentation/llms-txt/detector.js.map +1 -0
- package/dist/documentation/llms-txt/downloader.d.ts +30 -0
- package/dist/documentation/llms-txt/downloader.d.ts.map +1 -0
- package/dist/documentation/llms-txt/downloader.js +84 -0
- package/dist/documentation/llms-txt/downloader.js.map +1 -0
- package/dist/documentation/llms-txt/index.d.ts +4 -0
- package/dist/documentation/llms-txt/index.d.ts.map +1 -0
- package/dist/documentation/llms-txt/index.js +4 -0
- package/dist/documentation/llms-txt/index.js.map +1 -0
- package/dist/documentation/llms-txt/parser.d.ts +43 -0
- package/dist/documentation/llms-txt/parser.d.ts.map +1 -0
- package/dist/documentation/llms-txt/parser.js +177 -0
- package/dist/documentation/llms-txt/parser.js.map +1 -0
- package/dist/index.js +0 -0
- package/dist/server/mcp-server.d.ts.map +1 -1
- package/dist/server/mcp-server.js +31 -3
- package/dist/server/mcp-server.js.map +1 -1
- package/package.json +10 -11
package/README.md
CHANGED
|
@@ -18,18 +18,29 @@
|
|
|
18
18
|
- 智能缓存,基于版本键和 TTL
|
|
19
19
|
- 热门包的背景刷新
|
|
20
20
|
|
|
21
|
-
### 文档爬取和技能生成
|
|
21
|
+
### 文档爬取和技能生成 🚀
|
|
22
22
|
|
|
23
|
+
#### 核心功能
|
|
23
24
|
- 递归爬取文档网站,自动跟踪文档链接
|
|
24
25
|
- 智能边界检测,只在文档路径内爬取
|
|
25
26
|
- 生成结构化的 Agent Skill 输出(Markdown 格式)
|
|
26
27
|
- 支持深度限制、页面限制和速率控制
|
|
27
28
|
- 自动组织内容,生成目录和章节
|
|
29
|
+
|
|
30
|
+
#### 基础特性
|
|
28
31
|
- **SPA 检测**:自动检测单页应用并提供建议
|
|
29
32
|
- **智能重试**:对临时错误自动重试,提高成功率
|
|
30
33
|
- **错误分类**:详细的错误类型统计和分析
|
|
31
34
|
- **进度监控**:实时显示爬取进度和性能指标
|
|
32
35
|
|
|
36
|
+
#### 🌟 增强特性(新增)
|
|
37
|
+
- **双重爬取策略**:支持 BFS(广度优先)和 DFS(深度优先)两种策略
|
|
38
|
+
- **并行爬取**:1-10个worker并发,速度提升 **4-6倍** ⚡
|
|
39
|
+
- **断点续传**:中断后可恢复,支持大规模文档爬取
|
|
40
|
+
- **llms.txt支持**:自动检测和使用llms.txt文件,效率提升 **2-3倍** ⚡
|
|
41
|
+
- **Markdown支持**:完整支持.md文件爬取和结构化提取
|
|
42
|
+
- **智能质量检测**:6项指标多维度评估内容质量
|
|
43
|
+
|
|
33
44
|
### 网页文档浏览
|
|
34
45
|
|
|
35
46
|
- 浏览和搜索网页文档内容
|
|
@@ -435,11 +446,11 @@ pwd
|
|
|
435
446
|
在 https://expressjs.com/en/guide 中查找关于路由的内容
|
|
436
447
|
```
|
|
437
448
|
|
|
438
|
-
#### `crawl_documentation` -
|
|
449
|
+
#### `crawl_documentation` - 爬取文档并生成Agent Skill 🚀
|
|
439
450
|
|
|
440
451
|
递归爬取文档网站,提取内容并生成结构化的 Agent Skill。
|
|
441
452
|
|
|
442
|
-
|
|
453
|
+
**基础参数:**
|
|
443
454
|
- `url`(必需):文档根 URL
|
|
444
455
|
- `maxDepth`(可选):最大爬取深度(默认:3)
|
|
445
456
|
- `maxPages`(可选):最大页面数(默认:50)
|
|
@@ -451,35 +462,122 @@ pwd
|
|
|
451
462
|
- `outputDir`(可选):保存技能文件的目录(如果不提供,只返回内容)
|
|
452
463
|
- `filename`(可选):自定义文件名(不含扩展名)
|
|
453
464
|
|
|
454
|
-
|
|
465
|
+
**🌟 增强参数(新增):**
|
|
466
|
+
- `crawlStrategy`(可选):爬取策略 `'bfs'` | `'dfs'`(默认:bfs)
|
|
467
|
+
- **BFS**(广度优先):逐层爬取,适合全面覆盖
|
|
468
|
+
- **DFS**(深度优先):深入一条路径,适合深度理解
|
|
469
|
+
- `workers`(可选):并发worker数量 1-10(默认:1)
|
|
470
|
+
- 建议:小型站点2-3个,大型站点5-8个
|
|
471
|
+
- 性能提升:3 workers ≈ 3倍,5 workers ≈ 4-5倍
|
|
472
|
+
- `skipLlmsTxt`(可选):是否跳过llms.txt检测(默认:false)
|
|
473
|
+
- 启用时自动检测和使用llms.txt,效率提升2-3倍
|
|
474
|
+
- `checkpoint`(可选):断点续传配置
|
|
475
|
+
- `enabled`:是否启用断点功能
|
|
476
|
+
- `interval`:每N页保存一次(默认:10)
|
|
477
|
+
- `file`:checkpoint文件路径(可选)
|
|
478
|
+
- `resume`(可选):是否从上次断点恢复(默认:false)
|
|
479
|
+
|
|
480
|
+
**基础功能:**
|
|
455
481
|
- ✅ **SPA 检测**:自动检测单页应用(SPA)并提供建议
|
|
456
482
|
- ✅ **智能重试**:对临时错误(超时、网络错误等)自动重试
|
|
457
483
|
- ✅ **错误分类**:详细的错误类型分类和统计
|
|
458
484
|
- ✅ **进度日志**:实时显示爬取进度和统计信息
|
|
459
485
|
|
|
460
|
-
|
|
486
|
+
**🎯 使用示例:**
|
|
461
487
|
|
|
488
|
+
##### 1. 快速基础爬取
|
|
462
489
|
```
|
|
463
490
|
爬取 https://react.dev/docs 的文档并生成技能
|
|
464
491
|
```
|
|
465
492
|
|
|
493
|
+
##### 2. 并行加速爬取(推荐)
|
|
494
|
+
```
|
|
495
|
+
从 https://react.dev 生成 agent skill,使用 5 个并发worker,最多爬取 200 页
|
|
466
496
|
```
|
|
467
|
-
|
|
497
|
+
|
|
498
|
+
对应参数:
|
|
499
|
+
```json
|
|
500
|
+
{
|
|
501
|
+
"url": "https://react.dev",
|
|
502
|
+
"workers": 5,
|
|
503
|
+
"maxPages": 200,
|
|
504
|
+
"rateLimit": 200
|
|
505
|
+
}
|
|
468
506
|
```
|
|
469
507
|
|
|
470
|
-
|
|
508
|
+
##### 3. 深度优先策略
|
|
509
|
+
```
|
|
510
|
+
使用 DFS 策略爬取 https://docs.example.com,深度限制为 6
|
|
511
|
+
```
|
|
512
|
+
|
|
513
|
+
对应参数:
|
|
514
|
+
```json
|
|
515
|
+
{
|
|
516
|
+
"url": "https://docs.example.com",
|
|
517
|
+
"crawlStrategy": "dfs",
|
|
518
|
+
"maxDepth": 6,
|
|
519
|
+
"workers": 3
|
|
520
|
+
}
|
|
521
|
+
```
|
|
522
|
+
|
|
523
|
+
##### 4. 大规模爬取(带断点)
|
|
524
|
+
```
|
|
525
|
+
爬取 https://large-docs.com,最多 1000 页,启用断点每 50 页保存一次
|
|
526
|
+
```
|
|
527
|
+
|
|
528
|
+
对应参数:
|
|
529
|
+
```json
|
|
530
|
+
{
|
|
531
|
+
"url": "https://large-docs.com",
|
|
532
|
+
"maxPages": 1000,
|
|
533
|
+
"workers": 8,
|
|
534
|
+
"checkpoint": {
|
|
535
|
+
"enabled": true,
|
|
536
|
+
"interval": 50
|
|
537
|
+
}
|
|
538
|
+
}
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
##### 5. 恢复中断的爬取
|
|
542
|
+
```
|
|
543
|
+
从断点恢复 https://large-docs.com 的爬取
|
|
544
|
+
```
|
|
545
|
+
|
|
546
|
+
对应参数:
|
|
547
|
+
```json
|
|
548
|
+
{
|
|
549
|
+
"url": "https://large-docs.com",
|
|
550
|
+
"resume": true,
|
|
551
|
+
"checkpoint": { "enabled": true }
|
|
552
|
+
}
|
|
553
|
+
```
|
|
554
|
+
|
|
555
|
+
**📊 性能对比:**
|
|
556
|
+
|
|
557
|
+
| 爬取规模 | 串行模式 | 并行模式(workers=5) | 提升 |
|
|
558
|
+
|---------|---------|-------------------|------|
|
|
559
|
+
| 50页 | ~90秒 | ~22秒 | **4倍** ⚡ |
|
|
560
|
+
| 100页 | ~180秒 | ~45秒 | **4倍** ⚡ |
|
|
561
|
+
| 200页 | ~360秒 | ~80秒 | **4.5倍** ⚡ |
|
|
562
|
+
|
|
563
|
+
**输出内容:**
|
|
471
564
|
- `skillContent`: Markdown 格式的技能内容
|
|
472
565
|
- `metadata`: 技能元数据(标题、描述、源 URL 等)
|
|
473
|
-
- `crawlStats`:
|
|
566
|
+
- `crawlStats`: 爬取统计(总页数、最大深度、错误列表、质量指标)
|
|
474
567
|
- `files`(可选):保存的文件路径
|
|
475
|
-
- `skillFile`:
|
|
476
|
-
- `manifestFile`:
|
|
568
|
+
- `skillFile`: SKILL.md 文件路径
|
|
569
|
+
- `manifestFile`: metadata.json 文件路径
|
|
570
|
+
|
|
571
|
+
**💡 最佳实践:**
|
|
572
|
+
1. **先小规模测试**:maxPages设为10-20验证效果
|
|
573
|
+
2. **逐步增加并发**:从workers=2开始,观察效果
|
|
574
|
+
3. **大规模使用断点**:超过200页建议启用checkpoint
|
|
575
|
+
4. **关注质量指标**:diversity和coverage应 > 0.5
|
|
477
576
|
|
|
478
|
-
|
|
479
|
-
-
|
|
480
|
-
|
|
481
|
-
|
|
482
|
-
- 文件名自动从 URL 生成,或使用自定义 `filename` 参数
|
|
577
|
+
**📚 详细文档:**
|
|
578
|
+
- 完整使用指南:`docs/ENHANCED_CRAWLING.md`
|
|
579
|
+
- 快速上手:查看项目中的 `QUICKSTART_*.md` 文档
|
|
580
|
+
- 实施细节:`IMPLEMENTATION_SUMMARY.md`
|
|
483
581
|
|
|
484
582
|
#### `detect_errors` - 检测错误
|
|
485
583
|
|
|
@@ -800,6 +898,244 @@ chmod +x dist/index.js
|
|
|
800
898
|
|
|
801
899
|
MIT
|
|
802
900
|
|
|
901
|
+
## 🎓 在大模型中使用MCP Server生成Agent Skill
|
|
902
|
+
|
|
903
|
+
### 快速上手
|
|
904
|
+
|
|
905
|
+
#### 步骤1:在Claude/Cursor中触发爬取
|
|
906
|
+
|
|
907
|
+
**中文触发词:**
|
|
908
|
+
```
|
|
909
|
+
帮我爬取 https://react.dev 的文档并生成Agent Skill
|
|
910
|
+
```
|
|
911
|
+
|
|
912
|
+
```
|
|
913
|
+
使用并行爬取从 https://docs.example.com 生成技能,要快一点
|
|
914
|
+
```
|
|
915
|
+
|
|
916
|
+
```
|
|
917
|
+
递归抓取 https://nextjs.org/docs 的文档,深度优先策略
|
|
918
|
+
```
|
|
919
|
+
|
|
920
|
+
**英文触发词:**
|
|
921
|
+
```
|
|
922
|
+
Crawl https://react.dev and generate an agent skill
|
|
923
|
+
```
|
|
924
|
+
|
|
925
|
+
```
|
|
926
|
+
Use parallel crawling to create a skill from https://docs.example.com
|
|
927
|
+
```
|
|
928
|
+
|
|
929
|
+
```
|
|
930
|
+
Recursive crawl https://nextjs.org/docs with DFS strategy
|
|
931
|
+
```
|
|
932
|
+
|
|
933
|
+
#### 步骤2:配置爬取参数
|
|
934
|
+
|
|
935
|
+
Claude/Cursor会自动调用`crawl_documentation`工具,你可以在对话中指定参数:
|
|
936
|
+
|
|
937
|
+
**基础爬取:**
|
|
938
|
+
```
|
|
939
|
+
爬取 https://react.dev/docs,最多50页
|
|
940
|
+
```
|
|
941
|
+
|
|
942
|
+
**快速爬取(推荐):**
|
|
943
|
+
```
|
|
944
|
+
爬取 https://react.dev,使用5个并发worker,最多200页
|
|
945
|
+
```
|
|
946
|
+
|
|
947
|
+
**深度研究:**
|
|
948
|
+
```
|
|
949
|
+
使用DFS策略深入爬取 https://docs.example.com,深度限制为6层
|
|
950
|
+
```
|
|
951
|
+
|
|
952
|
+
**大规模爬取:**
|
|
953
|
+
```
|
|
954
|
+
爬取 https://large-docs.com,最多1000页,启用断点续传每50页保存
|
|
955
|
+
```
|
|
956
|
+
|
|
957
|
+
### 使用场景
|
|
958
|
+
|
|
959
|
+
#### 场景1:快速探索新技术栈
|
|
960
|
+
```
|
|
961
|
+
帮我爬取 Svelte 的官方文档,快速生成一个技能
|
|
962
|
+
```
|
|
963
|
+
|
|
964
|
+
**自动参数:**
|
|
965
|
+
- workers: 3(快速)
|
|
966
|
+
- maxPages: 50(快速探索)
|
|
967
|
+
- crawlStrategy: bfs(全面覆盖)
|
|
968
|
+
|
|
969
|
+
#### 场景2:深入学习框架
|
|
970
|
+
```
|
|
971
|
+
深入爬取 Vue 3 的文档,我要完整理解它的 API
|
|
972
|
+
```
|
|
973
|
+
|
|
974
|
+
**自动参数:**
|
|
975
|
+
- crawlStrategy: dfs(深入学习)
|
|
976
|
+
- maxDepth: 5(深入层级)
|
|
977
|
+
- workers: 5(加速)
|
|
978
|
+
- maxPages: 200(完整覆盖)
|
|
979
|
+
|
|
980
|
+
#### 场景3:大型文档站点
|
|
981
|
+
```
|
|
982
|
+
爬取 MDN Web Docs 的 JavaScript 部分,分多次进行,支持断点恢复
|
|
983
|
+
```
|
|
984
|
+
|
|
985
|
+
**自动参数:**
|
|
986
|
+
- workers: 8(大规模)
|
|
987
|
+
- maxPages: 1000
|
|
988
|
+
- checkpoint: { enabled: true, interval: 50 }
|
|
989
|
+
|
|
990
|
+
#### 场景4:恢复中断的爬取
|
|
991
|
+
```
|
|
992
|
+
上次爬取中断了,从断点继续爬取 https://large-docs.com
|
|
993
|
+
```
|
|
994
|
+
|
|
995
|
+
**自动参数:**
|
|
996
|
+
- resume: true
|
|
997
|
+
- checkpoint: { enabled: true }
|
|
998
|
+
|
|
999
|
+
### 生成的Agent Skill格式
|
|
1000
|
+
|
|
1001
|
+
爬取完成后,会生成标准的Claude Agent Skill格式:
|
|
1002
|
+
|
|
1003
|
+
```
|
|
1004
|
+
SKILL.md # 主技能文件
|
|
1005
|
+
├── 技能元数据(YAML frontmatter)
|
|
1006
|
+
├── When to Use This Skill # 使用场景
|
|
1007
|
+
├── Core Concepts # 核心概念
|
|
1008
|
+
├── API Reference # API参考
|
|
1009
|
+
├── Examples # 示例代码
|
|
1010
|
+
└── Reference Files # 引用文件列表
|
|
1011
|
+
|
|
1012
|
+
references/ # 引用文件目录
|
|
1013
|
+
├── page-1.md
|
|
1014
|
+
├── page-2.md
|
|
1015
|
+
└── ...
|
|
1016
|
+
```
|
|
1017
|
+
|
|
1018
|
+
### 技能质量指标
|
|
1019
|
+
|
|
1020
|
+
系统会自动评估生成的Skill质量:
|
|
1021
|
+
|
|
1022
|
+
| 指标 | 说明 | 优秀标准 |
|
|
1023
|
+
|-----|-----|---------|
|
|
1024
|
+
| **内容充足性** | 每页内容长度 | > 100字符/页 |
|
|
1025
|
+
| **结构完整性** | 标题、章节组织 | 至少1个标题/页 |
|
|
1026
|
+
| **内容多样性** | URL路径和主题分布 | diversity > 0.7 |
|
|
1027
|
+
| **API覆盖度** | 代码示例比例 | coverage > 0.5 |
|
|
1028
|
+
|
|
1029
|
+
### 使用生成的Skill
|
|
1030
|
+
|
|
1031
|
+
#### 方法1:直接在对话中使用
|
|
1032
|
+
|
|
1033
|
+
生成的Skill会自动出现在Claude的技能列表中,你可以直接引用:
|
|
1034
|
+
|
|
1035
|
+
```
|
|
1036
|
+
根据我刚生成的React技能,帮我写一个自定义Hook
|
|
1037
|
+
```
|
|
1038
|
+
|
|
1039
|
+
#### 方法2:保存为文件
|
|
1040
|
+
|
|
1041
|
+
如果指定了`outputDir`,Skill会保存为文件:
|
|
1042
|
+
|
|
1043
|
+
```
|
|
1044
|
+
爬取并保存到 ./skills 目录,文件名为 react-docs
|
|
1045
|
+
```
|
|
1046
|
+
|
|
1047
|
+
生成文件:
|
|
1048
|
+
- `./skills/react-docs.md` - Skill内容
|
|
1049
|
+
- `./skills/react-docs.metadata.json` - 元数据
|
|
1050
|
+
|
|
1051
|
+
#### 方法3:分享给团队
|
|
1052
|
+
|
|
1053
|
+
将生成的SKILL.md文件放入项目的`.claude/skills/`目录,团队成员都可使用:
|
|
1054
|
+
|
|
1055
|
+
```bash
|
|
1056
|
+
# 复制到项目技能目录
|
|
1057
|
+
cp output/react-docs.md .claude/skills/
|
|
1058
|
+
|
|
1059
|
+
# 提交到版本控制
|
|
1060
|
+
git add .claude/skills/react-docs.md
|
|
1061
|
+
git commit -m "Add React documentation skill"
|
|
1062
|
+
```
|
|
1063
|
+
|
|
1064
|
+
### 常见问题
|
|
1065
|
+
|
|
1066
|
+
#### Q: 如何选择爬取策略?
|
|
1067
|
+
**A:**
|
|
1068
|
+
- **BFS(广度优先)**:适合扁平结构、需要全面覆盖的文档
|
|
1069
|
+
- **DFS(深度优先)**:适合层级结构、需要深入理解的文档
|
|
1070
|
+
|
|
1071
|
+
#### Q: workers设置多少合适?
|
|
1072
|
+
**A:**
|
|
1073
|
+
- 小型站点(<50页):workers: 2-3
|
|
1074
|
+
- 中型站点(50-200页):workers: 3-5
|
|
1075
|
+
- 大型站点(>200页):workers: 5-8
|
|
1076
|
+
|
|
1077
|
+
#### Q: 什么时候需要启用断点?
|
|
1078
|
+
**A:**
|
|
1079
|
+
- 爬取页面超过200页
|
|
1080
|
+
- 网络不稳定
|
|
1081
|
+
- 需要分批次爬取
|
|
1082
|
+
|
|
1083
|
+
#### Q: 如何提高爬取速度?
|
|
1084
|
+
**A:**
|
|
1085
|
+
1. 增加workers数量(推荐5-8个)
|
|
1086
|
+
2. 降低rateLimit(200-300ms)
|
|
1087
|
+
3. 确保站点有llms.txt(自动检测)
|
|
1088
|
+
|
|
1089
|
+
#### Q: 生成的Skill质量不够怎么办?
|
|
1090
|
+
**A:**
|
|
1091
|
+
1. 增加maxPages(建议200+)
|
|
1092
|
+
2. 调整maxDepth(建议3-5)
|
|
1093
|
+
3. 使用includePaths精确指定路径
|
|
1094
|
+
4. 查看日志中的质量指标建议
|
|
1095
|
+
|
|
1096
|
+
### 性能提示
|
|
1097
|
+
|
|
1098
|
+
**🚀 极速模式(适合快速探索):**
|
|
1099
|
+
```json
|
|
1100
|
+
{
|
|
1101
|
+
"workers": 5,
|
|
1102
|
+
"maxPages": 50,
|
|
1103
|
+
"rateLimit": 200,
|
|
1104
|
+
"crawlStrategy": "bfs"
|
|
1105
|
+
}
|
|
1106
|
+
```
|
|
1107
|
+
|
|
1108
|
+
**🎯 深度模式(适合完整学习):**
|
|
1109
|
+
```json
|
|
1110
|
+
{
|
|
1111
|
+
"workers": 5,
|
|
1112
|
+
"maxPages": 300,
|
|
1113
|
+
"maxDepth": 5,
|
|
1114
|
+
"crawlStrategy": "dfs",
|
|
1115
|
+
"checkpoint": { "enabled": true }
|
|
1116
|
+
}
|
|
1117
|
+
```
|
|
1118
|
+
|
|
1119
|
+
**⚡ 超快模式(适合有llms.txt的站点):**
|
|
1120
|
+
```json
|
|
1121
|
+
{
|
|
1122
|
+
"workers": 8,
|
|
1123
|
+
"skipLlmsTxt": false,
|
|
1124
|
+
"rateLimit": 200
|
|
1125
|
+
}
|
|
1126
|
+
```
|
|
1127
|
+
|
|
1128
|
+
### 完整文档
|
|
1129
|
+
|
|
1130
|
+
- **快速上手指南**:项目中的快速开始文档
|
|
1131
|
+
- **详细使用说明**:`docs/ENHANCED_CRAWLING.md`
|
|
1132
|
+
- **实施技术细节**:`IMPLEMENTATION_SUMMARY.md`
|
|
1133
|
+
- **原始文档爬取指南**:`docs/DOC_CRAWLER_USAGE.md`
|
|
1134
|
+
|
|
1135
|
+
---
|
|
1136
|
+
|
|
803
1137
|
## 贡献
|
|
804
1138
|
|
|
805
1139
|
欢迎提交 Issue 和 Pull Request!
|
|
1140
|
+
|
|
1141
|
+
基于 [Skill_Seekers](examples/Skill_Seekers/) 项目的优秀实践,我们实现了完整对标并超越的TypeScript版本。
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
import type { CrawlOptions } from './doc-crawler.js';
|
|
2
|
+
export interface CheckpointData {
|
|
3
|
+
config: CrawlOptions;
|
|
4
|
+
visitedUrls: string[];
|
|
5
|
+
pendingUrls: Array<{
|
|
6
|
+
url: string;
|
|
7
|
+
depth: number;
|
|
8
|
+
}>;
|
|
9
|
+
pagesCrawled: number;
|
|
10
|
+
lastUpdated: string;
|
|
11
|
+
baseUrl: string;
|
|
12
|
+
}
|
|
13
|
+
/**
|
|
14
|
+
* Manager for crawl checkpoint/resume functionality
|
|
15
|
+
* Based on Skill_Seekers implementation
|
|
16
|
+
*/
|
|
17
|
+
export declare class CheckpointManager {
|
|
18
|
+
private checkpointFile;
|
|
19
|
+
constructor(checkpointFile: string);
|
|
20
|
+
/**
|
|
21
|
+
* Save checkpoint data to file
|
|
22
|
+
*/
|
|
23
|
+
saveCheckpoint(data: CheckpointData): Promise<void>;
|
|
24
|
+
/**
|
|
25
|
+
* Load checkpoint data from file
|
|
26
|
+
* Returns null if file doesn't exist or is invalid
|
|
27
|
+
*/
|
|
28
|
+
loadCheckpoint(): Promise<CheckpointData | null>;
|
|
29
|
+
/**
|
|
30
|
+
* Clear checkpoint file
|
|
31
|
+
*/
|
|
32
|
+
clearCheckpoint(): Promise<void>;
|
|
33
|
+
/**
|
|
34
|
+
* Check if checkpoint exists
|
|
35
|
+
*/
|
|
36
|
+
exists(): Promise<boolean>;
|
|
37
|
+
}
|
|
38
|
+
//# sourceMappingURL=checkpoint-manager.d.ts.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"checkpoint-manager.d.ts","sourceRoot":"","sources":["../../src/documentation/checkpoint-manager.ts"],"names":[],"mappings":"AAEA,OAAO,KAAK,EAAE,YAAY,EAAE,MAAM,kBAAkB,CAAC;AAErD,MAAM,WAAW,cAAc;IAC7B,MAAM,EAAE,YAAY,CAAC;IACrB,WAAW,EAAE,MAAM,EAAE,CAAC;IACtB,WAAW,EAAE,KAAK,CAAC;QAAE,GAAG,EAAE,MAAM,CAAC;QAAC,KAAK,EAAE,MAAM,CAAA;KAAE,CAAC,CAAC;IACnD,YAAY,EAAE,MAAM,CAAC;IACrB,WAAW,EAAE,MAAM,CAAC;IACpB,OAAO,EAAE,MAAM,CAAC;CACjB;AAED;;;GAGG;AACH,qBAAa,iBAAiB;IAC5B,OAAO,CAAC,cAAc,CAAS;gBAEnB,cAAc,EAAE,MAAM;IAIlC;;OAEG;IACG,cAAc,CAAC,IAAI,EAAE,cAAc,GAAG,OAAO,CAAC,IAAI,CAAC;IAqBzD;;;OAGG;IACG,cAAc,IAAI,OAAO,CAAC,cAAc,GAAG,IAAI,CAAC;IAiCtD;;OAEG;IACG,eAAe,IAAI,OAAO,CAAC,IAAI,CAAC;IAkBtC;;OAEG;IACG,MAAM,IAAI,OAAO,CAAC,OAAO,CAAC;CAQjC"}
|
|
@@ -0,0 +1,101 @@
|
|
|
1
|
+
import { writeFile, readFile, unlink, access } from 'fs/promises';
|
|
2
|
+
import { logger } from '../utils/logger.js';
|
|
3
|
+
/**
|
|
4
|
+
* Manager for crawl checkpoint/resume functionality
|
|
5
|
+
* Based on Skill_Seekers implementation
|
|
6
|
+
*/
|
|
7
|
+
export class CheckpointManager {
|
|
8
|
+
checkpointFile;
|
|
9
|
+
constructor(checkpointFile) {
|
|
10
|
+
this.checkpointFile = checkpointFile;
|
|
11
|
+
}
|
|
12
|
+
/**
|
|
13
|
+
* Save checkpoint data to file
|
|
14
|
+
*/
|
|
15
|
+
async saveCheckpoint(data) {
|
|
16
|
+
try {
|
|
17
|
+
const json = JSON.stringify(data, null, 2);
|
|
18
|
+
await writeFile(this.checkpointFile, json, 'utf-8');
|
|
19
|
+
logger.info('Checkpoint saved', {
|
|
20
|
+
file: this.checkpointFile,
|
|
21
|
+
pagesCrawled: data.pagesCrawled,
|
|
22
|
+
pendingUrls: data.pendingUrls.length,
|
|
23
|
+
visitedUrls: data.visitedUrls.length,
|
|
24
|
+
});
|
|
25
|
+
}
|
|
26
|
+
catch (error) {
|
|
27
|
+
const errorMessage = error instanceof Error ? error.message : String(error);
|
|
28
|
+
logger.error('Failed to save checkpoint', {
|
|
29
|
+
file: this.checkpointFile,
|
|
30
|
+
error: errorMessage,
|
|
31
|
+
});
|
|
32
|
+
throw new Error(`Failed to save checkpoint: ${errorMessage}`);
|
|
33
|
+
}
|
|
34
|
+
}
|
|
35
|
+
/**
|
|
36
|
+
* Load checkpoint data from file
|
|
37
|
+
* Returns null if file doesn't exist or is invalid
|
|
38
|
+
*/
|
|
39
|
+
async loadCheckpoint() {
|
|
40
|
+
try {
|
|
41
|
+
// Check if file exists
|
|
42
|
+
await access(this.checkpointFile);
|
|
43
|
+
// Read and parse checkpoint
|
|
44
|
+
const content = await readFile(this.checkpointFile, 'utf-8');
|
|
45
|
+
const data = JSON.parse(content);
|
|
46
|
+
logger.info('Checkpoint loaded', {
|
|
47
|
+
file: this.checkpointFile,
|
|
48
|
+
pagesCrawled: data.pagesCrawled,
|
|
49
|
+
pendingUrls: data.pendingUrls.length,
|
|
50
|
+
visitedUrls: data.visitedUrls.length,
|
|
51
|
+
lastUpdated: data.lastUpdated,
|
|
52
|
+
});
|
|
53
|
+
return data;
|
|
54
|
+
}
|
|
55
|
+
catch (error) {
|
|
56
|
+
if (error.code === 'ENOENT') {
|
|
57
|
+
logger.debug('No checkpoint file found', { file: this.checkpointFile });
|
|
58
|
+
return null;
|
|
59
|
+
}
|
|
60
|
+
const errorMessage = error instanceof Error ? error.message : String(error);
|
|
61
|
+
logger.warn('Failed to load checkpoint', {
|
|
62
|
+
file: this.checkpointFile,
|
|
63
|
+
error: errorMessage,
|
|
64
|
+
});
|
|
65
|
+
return null;
|
|
66
|
+
}
|
|
67
|
+
}
|
|
68
|
+
/**
|
|
69
|
+
* Clear checkpoint file
|
|
70
|
+
*/
|
|
71
|
+
async clearCheckpoint() {
|
|
72
|
+
try {
|
|
73
|
+
await unlink(this.checkpointFile);
|
|
74
|
+
logger.info('Checkpoint cleared', { file: this.checkpointFile });
|
|
75
|
+
}
|
|
76
|
+
catch (error) {
|
|
77
|
+
if (error.code === 'ENOENT') {
|
|
78
|
+
// File doesn't exist, that's fine
|
|
79
|
+
return;
|
|
80
|
+
}
|
|
81
|
+
const errorMessage = error instanceof Error ? error.message : String(error);
|
|
82
|
+
logger.warn('Failed to clear checkpoint', {
|
|
83
|
+
file: this.checkpointFile,
|
|
84
|
+
error: errorMessage,
|
|
85
|
+
});
|
|
86
|
+
}
|
|
87
|
+
}
|
|
88
|
+
/**
|
|
89
|
+
* Check if checkpoint exists
|
|
90
|
+
*/
|
|
91
|
+
async exists() {
|
|
92
|
+
try {
|
|
93
|
+
await access(this.checkpointFile);
|
|
94
|
+
return true;
|
|
95
|
+
}
|
|
96
|
+
catch {
|
|
97
|
+
return false;
|
|
98
|
+
}
|
|
99
|
+
}
|
|
100
|
+
}
|
|
101
|
+
//# sourceMappingURL=checkpoint-manager.js.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"checkpoint-manager.js","sourceRoot":"","sources":["../../src/documentation/checkpoint-manager.ts"],"names":[],"mappings":"AAAA,OAAO,EAAE,SAAS,EAAE,QAAQ,EAAE,MAAM,EAAE,MAAM,EAAE,MAAM,aAAa,CAAC;AAClE,OAAO,EAAE,MAAM,EAAE,MAAM,oBAAoB,CAAC;AAY5C;;;GAGG;AACH,MAAM,OAAO,iBAAiB;IACpB,cAAc,CAAS;IAE/B,YAAY,cAAsB;QAChC,IAAI,CAAC,cAAc,GAAG,cAAc,CAAC;IACvC,CAAC;IAED;;OAEG;IACH,KAAK,CAAC,cAAc,CAAC,IAAoB;QACvC,IAAI,CAAC;YACH,MAAM,IAAI,GAAG,IAAI,CAAC,SAAS,CAAC,IAAI,EAAE,IAAI,EAAE,CAAC,CAAC,CAAC;YAC3C,MAAM,SAAS,CAAC,IAAI,CAAC,cAAc,EAAE,IAAI,EAAE,OAAO,CAAC,CAAC;YAEpD,MAAM,CAAC,IAAI,CAAC,kBAAkB,EAAE;gBAC9B,IAAI,EAAE,IAAI,CAAC,cAAc;gBACzB,YAAY,EAAE,IAAI,CAAC,YAAY;gBAC/B,WAAW,EAAE,IAAI,CAAC,WAAW,CAAC,MAAM;gBACpC,WAAW,EAAE,IAAI,CAAC,WAAW,CAAC,MAAM;aACrC,CAAC,CAAC;QACL,CAAC;QAAC,OAAO,KAAK,EAAE,CAAC;YACf,MAAM,YAAY,GAAG,KAAK,YAAY,KAAK,CAAC,CAAC,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,KAAK,CAAC,CAAC;YAC5E,MAAM,CAAC,KAAK,CAAC,2BAA2B,EAAE;gBACxC,IAAI,EAAE,IAAI,CAAC,cAAc;gBACzB,KAAK,EAAE,YAAY;aACpB,CAAC,CAAC;YACH,MAAM,IAAI,KAAK,CAAC,8BAA8B,YAAY,EAAE,CAAC,CAAC;QAChE,CAAC;IACH,CAAC;IAED;;;OAGG;IACH,KAAK,CAAC,cAAc;QAClB,IAAI,CAAC;YACH,uBAAuB;YACvB,MAAM,MAAM,CAAC,IAAI,CAAC,cAAc,CAAC,CAAC;YAElC,4BAA4B;YAC5B,MAAM,OAAO,GAAG,MAAM,QAAQ,CAAC,IAAI,CAAC,cAAc,EAAE,OAAO,CAAC,CAAC;YAC7D,MAAM,IAAI,GAAG,IAAI,CAAC,KAAK,CAAC,OAAO,CAAmB,CAAC;YAEnD,MAAM,CAAC,IAAI,CAAC,mBAAmB,EAAE;gBAC/B,IAAI,EAAE,IAAI,CAAC,cAAc;gBACzB,YAAY,EAAE,IAAI,CAAC,YAAY;gBAC/B,WAAW,EAAE,IAAI,CAAC,WAAW,CAAC,MAAM;gBACpC,WAAW,EAAE,IAAI,CAAC,WAAW,CAAC,MAAM;gBACpC,WAAW,EAAE,IAAI,CAAC,WAAW;aAC9B,CAAC,CAAC;YAEH,OAAO,IAAI,CAAC;QACd,CAAC;QAAC,OAAO,KAAK,EAAE,CAAC;YACf,IAAK,KAA+B,CAAC,IAAI,KAAK,QAAQ,EAAE,CAAC;gBACvD,MAAM,CAAC,KAAK,CAAC,0BAA0B,EAAE,EAAE,IAAI,EAAE,IAAI,CAAC,cAAc,EAAE,CAAC,CAAC;gBACxE,OAAO,IAAI,CAAC;YACd,CAAC;YAED,MAAM,YAAY,GAAG,KAAK,YAAY,KAAK,CAAC,CAAC,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,KAAK,CAAC,CAAC;YAC5E,MAAM,CAAC,IAAI,CAAC,2BAA2B,EAAE;gBACvC,IAAI,EAAE,IAAI,CAAC,cAAc;gBACzB,KAAK,EAAE,YAAY;aACpB,CAAC,CAAC;YACH,OAAO,IAAI,CAAC;QACd,CAAC;IACH,CAAC;IAED;;OAEG;IACH,KAAK,CAAC,eAAe;QACnB,IAAI,CAAC;YACH,MAAM,MAAM,CAAC,IAAI,CAAC,cAAc,CAAC,CAAC;YAClC,MAAM,CAAC,IAAI,CAAC,oBAAoB,EAAE,EAAE,IAAI,EAAE,IAAI,CAAC,cAAc,EAAE,CAAC,CAAC;QACnE,CAAC;QAAC,OAAO,KAAK,EAAE,CAAC;YACf,IAAK,KAA+B,CAAC,IAAI,KAAK,QAAQ,EAAE,CAAC;gBACvD,kCAAkC;gBAClC,OAAO;YACT,CAAC;YAED,MAAM,YAAY,GAAG,KAAK,YAAY,KAAK,CAAC,CAAC,CAAC,KAAK,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,KAAK,CAAC,CAAC;YAC5E,MAAM,CAAC,IAAI,CAAC,4BAA4B,EAAE;gBACxC,IAAI,EAAE,IAAI,CAAC,cAAc;gBACzB,KAAK,EAAE,YAAY;aACpB,CAAC,CAAC;QACL,CAAC;IACH,CAAC;IAED;;OAEG;IACH,KAAK,CAAC,MAAM;QACV,IAAI,CAAC;YACH,MAAM,MAAM,CAAC,IAAI,CAAC,cAAc,CAAC,CAAC;YAClC,OAAO,IAAI,CAAC;QACd,CAAC;QAAC,MAAM,CAAC;YACP,OAAO,KAAK,CAAC;QACf,CAAC;IACH,CAAC;CACF"}
|
|
@@ -1,5 +1,6 @@
|
|
|
1
1
|
import { HttpClient } from '../utils/http-client.js';
|
|
2
2
|
export interface CrawlOptions {
|
|
3
|
+
crawlStrategy?: 'bfs' | 'dfs';
|
|
3
4
|
maxDepth?: number;
|
|
4
5
|
maxPages?: number;
|
|
5
6
|
includePaths?: string[];
|
|
@@ -8,6 +9,14 @@ export interface CrawlOptions {
|
|
|
8
9
|
maxRetries?: number;
|
|
9
10
|
retryDelay?: number;
|
|
10
11
|
useBrowserAutomation?: boolean;
|
|
12
|
+
skipLlmsTxt?: boolean;
|
|
13
|
+
workers?: number;
|
|
14
|
+
checkpoint?: {
|
|
15
|
+
enabled: boolean;
|
|
16
|
+
interval: number;
|
|
17
|
+
file?: string;
|
|
18
|
+
};
|
|
19
|
+
resume?: boolean;
|
|
11
20
|
}
|
|
12
21
|
export interface CrawledPage {
|
|
13
22
|
url: string;
|
|
@@ -69,6 +78,8 @@ export declare class DocumentationCrawler {
|
|
|
69
78
|
private options;
|
|
70
79
|
private baseUrl;
|
|
71
80
|
private linkDiscoveryStats;
|
|
81
|
+
private checkpointManager?;
|
|
82
|
+
private pagesSinceLastCheckpoint;
|
|
72
83
|
private readonly DOCUMENTATION_PATTERNS;
|
|
73
84
|
private readonly EXCLUDED_PATTERNS;
|
|
74
85
|
constructor(httpClient?: HttpClient);
|
|
@@ -76,8 +87,21 @@ export declare class DocumentationCrawler {
|
|
|
76
87
|
* Crawl documentation starting from a root URL
|
|
77
88
|
* Uses HTTP client (axios) exclusively - no browser automation
|
|
78
89
|
* For SPA sites that require JavaScript rendering, use Cursor/Claude's built-in browser tools
|
|
90
|
+
* Supports both BFS (breadth-first) and DFS (depth-first) crawl strategies
|
|
79
91
|
*/
|
|
80
92
|
crawl(rootUrl: string, options?: CrawlOptions): Promise<CrawlResult>;
|
|
93
|
+
/**
|
|
94
|
+
* Sequential crawling (single-threaded)
|
|
95
|
+
*/
|
|
96
|
+
private crawlSequential;
|
|
97
|
+
/**
|
|
98
|
+
* Parallel crawling with multiple workers
|
|
99
|
+
*/
|
|
100
|
+
private crawlWithWorkers;
|
|
101
|
+
/**
|
|
102
|
+
* Process a single page (shared by both sequential and parallel crawling)
|
|
103
|
+
*/
|
|
104
|
+
private processPage;
|
|
81
105
|
/**
|
|
82
106
|
* Discover documentation links from a crawled page
|
|
83
107
|
*/
|
|
@@ -92,13 +116,31 @@ export declare class DocumentationCrawler {
|
|
|
92
116
|
private shouldExclude;
|
|
93
117
|
/**
|
|
94
118
|
* Check if crawled content is sufficient for skill generation
|
|
95
|
-
*
|
|
119
|
+
* Enhanced with multi-dimensional quality metrics
|
|
96
120
|
*/
|
|
97
121
|
private canGenerateSkill;
|
|
122
|
+
/**
|
|
123
|
+
* Evaluate content quality with multi-dimensional metrics
|
|
124
|
+
*/
|
|
125
|
+
private evaluateContentQuality;
|
|
126
|
+
/**
|
|
127
|
+
* Check if should continue crawling based on content quality
|
|
128
|
+
*/
|
|
129
|
+
private shouldContinueCrawling;
|
|
98
130
|
/**
|
|
99
131
|
* Fetch a page with retry logic
|
|
132
|
+
* Supports both HTML pages and Markdown files
|
|
100
133
|
*/
|
|
101
134
|
private fetchPageWithRetry;
|
|
135
|
+
/**
|
|
136
|
+
* Extract content from Markdown file
|
|
137
|
+
* Converts Markdown structure to WebDocumentationPage format
|
|
138
|
+
*/
|
|
139
|
+
private extractMarkdownContent;
|
|
140
|
+
/**
|
|
141
|
+
* Parse Markdown content into structured data
|
|
142
|
+
*/
|
|
143
|
+
private parseMarkdown;
|
|
102
144
|
/**
|
|
103
145
|
* Classify error type for better error messages
|
|
104
146
|
*/
|
|
@@ -111,6 +153,30 @@ export declare class DocumentationCrawler {
|
|
|
111
153
|
* Get error breakdown by type
|
|
112
154
|
*/
|
|
113
155
|
private getErrorBreakdown;
|
|
156
|
+
/**
|
|
157
|
+
* Try to detect and use llms.txt for optimized crawling
|
|
158
|
+
*/
|
|
159
|
+
private tryLlmsTxt;
|
|
160
|
+
/**
|
|
161
|
+
* Check if a URL is valid for crawling
|
|
162
|
+
*/
|
|
163
|
+
private isValidUrl;
|
|
164
|
+
/**
|
|
165
|
+
* Save checkpoint
|
|
166
|
+
*/
|
|
167
|
+
private saveCheckpoint;
|
|
168
|
+
/**
|
|
169
|
+
* Load checkpoint and restore state
|
|
170
|
+
*/
|
|
171
|
+
private loadCheckpoint;
|
|
172
|
+
/**
|
|
173
|
+
* Clear checkpoint after successful crawl
|
|
174
|
+
*/
|
|
175
|
+
private clearCheckpoint;
|
|
176
|
+
/**
|
|
177
|
+
* Sanitize filename for checkpoint
|
|
178
|
+
*/
|
|
179
|
+
private sanitizeFilename;
|
|
114
180
|
/**
|
|
115
181
|
* Delay helper for rate limiting
|
|
116
182
|
*/
|