@loopsaaage/n8n-nodes-smart-crawler 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE.md +19 -0
- package/README.md +165 -0
- package/dist/icons/github.dark.svg +3 -0
- package/dist/icons/github.svg +3 -0
- package/dist/icons/smart-crawler.dark.svg +9 -0
- package/dist/icons/smart-crawler.svg +9 -0
- package/dist/nodes/SmartCrawler/SmartCrawler.node.d.ts +10 -0
- package/dist/nodes/SmartCrawler/SmartCrawler.node.js +615 -0
- package/dist/nodes/SmartCrawler/SmartCrawler.node.js.map +1 -0
- package/dist/nodes/SmartCrawler/SmartCrawler.node.test.d.ts +1 -0
- package/dist/nodes/SmartCrawler/SmartCrawler.node.test.js +244 -0
- package/dist/nodes/SmartCrawler/SmartCrawler.node.test.js.map +1 -0
- package/dist/package.json +59 -0
- package/dist/tsconfig.tsbuildinfo +1 -0
- package/package.json +59 -0
package/LICENSE.md
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
Copyright 2022 n8n
|
|
2
|
+
|
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
|
4
|
+
this software and associated documentation files (the "Software"), to deal in
|
|
5
|
+
the Software without restriction, including without limitation the rights to
|
|
6
|
+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
|
|
7
|
+
of the Software, and to permit persons to whom the Software is furnished to do
|
|
8
|
+
so, subject to the following conditions:
|
|
9
|
+
|
|
10
|
+
The above copyright notice and this permission notice shall be included in all
|
|
11
|
+
copies or substantial portions of the Software.
|
|
12
|
+
|
|
13
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
15
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
16
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
17
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
18
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
19
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
1
|
+
# n8n-nodes-smart-crawler
|
|
2
|
+
|
|
3
|
+
[](https://www.npmjs.com/package/n8n-nodes-smart-crawler)
|
|
4
|
+
|
|
5
|
+
通用智能爬虫 n8n 节点,使用 Axios + Cheerio 实现,支持灵活的页面数据提取和多跳数据采集。
|
|
6
|
+
|
|
7
|
+
## 功能特性
|
|
8
|
+
|
|
9
|
+
- **🎯 灵活的选择器配置**:使用 CSS 选择器精确定位页面元素
|
|
10
|
+
- **📦 多字段提取**:支持同时提取多个字段,包括文本、HTML、属性值
|
|
11
|
+
- **🔗 多跳支持**:最多支持 3 跳跳转,深度提取嵌套页面数据
|
|
12
|
+
- **🍪 Cookie 支持**:支持配置 Cookie 访问需要登录的页面
|
|
13
|
+
- **⚙️ 预设字段**:提供常用的字段提取配置选项
|
|
14
|
+
- **🔄 自动 URL 解析**:自动处理相对路径和绝对路径 URL
|
|
15
|
+
|
|
16
|
+
## 快速开始
|
|
17
|
+
|
|
18
|
+
### 安装
|
|
19
|
+
|
|
20
|
+
在 n8n 中安装此节点:
|
|
21
|
+
|
|
22
|
+
```bash
|
|
23
|
+
npm install n8n-nodes-smart-crawler
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
然后在 n8n 设置中刷新节点,即可在节点面板中使用。
|
|
27
|
+
|
|
28
|
+
### 基础使用
|
|
29
|
+
|
|
30
|
+
1. 添加 **Smart Crawler** 节点到工作流
|
|
31
|
+
2. 配置以下参数:
|
|
32
|
+
- **页面链接**:要爬取的页面 URL
|
|
33
|
+
- **Cookie**:(可选)访问需要登录的页面时设置
|
|
34
|
+
- **列表选择器**:选择数据列表的 CSS 选择器
|
|
35
|
+
3. 添加字段配置:
|
|
36
|
+
- **字段名称**:输出数据的字段名
|
|
37
|
+
- **选择器**:CSS 选择器定位元素
|
|
38
|
+
- **提取类型**:文本内容 / HTML 内容 / 属性值
|
|
39
|
+
- **属性名**:(当类型为属性值时)指定属性名
|
|
40
|
+
4. 执行工作流,提取数据
|
|
41
|
+
|
|
42
|
+
## 配置说明
|
|
43
|
+
|
|
44
|
+
### 列表选择器
|
|
45
|
+
|
|
46
|
+
指定包含多个数据项的容器选择器。例如:
|
|
47
|
+
- `.product-item` - 选择所有 class 为 product-item 的元素
|
|
48
|
+
- `.news-list > li` - 选择列表中的所有列表项
|
|
49
|
+
- `div[data-type="item"]` - 选择具有特定属性的元素
|
|
50
|
+
|
|
51
|
+
### 字段配置
|
|
52
|
+
|
|
53
|
+
每个字段配置项可以提取列表项中的特定数据。
|
|
54
|
+
|
|
55
|
+
#### 提取类型
|
|
56
|
+
|
|
57
|
+
- **文本内容**:提取元素的文本内容(去除空白)
|
|
58
|
+
- **HTML 内容**:提取元素的 HTML 代码
|
|
59
|
+
- **属性值**:提取元素的指定属性值
|
|
60
|
+
|
|
61
|
+
### 多跳配置
|
|
62
|
+
|
|
63
|
+
对于需要跳转到其他页面提取的字段,可以启用跳转配置。
|
|
64
|
+
|
|
65
|
+
#### 跳转配置结构
|
|
66
|
+
|
|
67
|
+
每跳包含以下配置:
|
|
68
|
+
|
|
69
|
+
- **点击元素选择器**:用于获取跳转链接的元素
|
|
70
|
+
- **目标页面数据选择器**:(可选)跳转后页面的数据容器选择器
|
|
71
|
+
- **字段**:在跳转页面要提取的字段列表
|
|
72
|
+
|
|
73
|
+
#### 跳转层级
|
|
74
|
+
|
|
75
|
+
- **第一跳**:从列表项跳转到详情页
|
|
76
|
+
- **第二跳**:从详情页跳转到相关页面
|
|
77
|
+
- **第三跳**:继续深入跳转
|
|
78
|
+
|
|
79
|
+
## 使用示例
|
|
80
|
+
|
|
81
|
+
### 示例 1:提取新闻列表
|
|
82
|
+
|
|
83
|
+
```
|
|
84
|
+
页面链接:https://news.example.com
|
|
85
|
+
列表选择器:.news-item
|
|
86
|
+
字段:
|
|
87
|
+
- 字段名称:title
|
|
88
|
+
选择器:.title
|
|
89
|
+
提取类型:文本内容
|
|
90
|
+
- 字段名称:link
|
|
91
|
+
选择器:a.read-more
|
|
92
|
+
提取类型:属性值
|
|
93
|
+
属性名:href
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
### 示例 2:多跳提取商品详情
|
|
97
|
+
|
|
98
|
+
```
|
|
99
|
+
页面链接:https://shop.example.com/products
|
|
100
|
+
列表选择器:.product-card
|
|
101
|
+
字段:
|
|
102
|
+
- 字段名称:productName
|
|
103
|
+
选择器:h3.name
|
|
104
|
+
提取类型:文本内容
|
|
105
|
+
- 字段名称:details
|
|
106
|
+
选择器:a.detail-link
|
|
107
|
+
是否为跳转字段:是
|
|
108
|
+
跳转配置(第一跳):
|
|
109
|
+
点击元素选择器:a
|
|
110
|
+
字段:
|
|
111
|
+
- 字段名称:price
|
|
112
|
+
选择器:.price
|
|
113
|
+
提取类型:文本内容
|
|
114
|
+
- 字段名称:description
|
|
115
|
+
选择器:.description
|
|
116
|
+
提取类型:HTML 内容
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
## 开发
|
|
120
|
+
|
|
121
|
+
### 本地开发
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
# 安装依赖
|
|
125
|
+
npm install
|
|
126
|
+
|
|
127
|
+
# 启动开发服务器
|
|
128
|
+
npm run dev
|
|
129
|
+
|
|
130
|
+
# 构建
|
|
131
|
+
npm run build
|
|
132
|
+
|
|
133
|
+
# 代码检查
|
|
134
|
+
npm run lint
|
|
135
|
+
npm run lint:fix
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
### 发布
|
|
139
|
+
|
|
140
|
+
```bash
|
|
141
|
+
# 构建项目
|
|
142
|
+
npm run build
|
|
143
|
+
|
|
144
|
+
# 发布到 npm
|
|
145
|
+
npm publish
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
## 技术栈
|
|
149
|
+
|
|
150
|
+
- **[Axios](https://github.com/axios/axios)** - HTTP 请求库
|
|
151
|
+
- **[Cheerio](https://github.com/cheeriojs/cheerio)** - 快速的 HTML/XML 解析器
|
|
152
|
+
|
|
153
|
+
## 许可证
|
|
154
|
+
|
|
155
|
+
[MIT](LICENSE.md)
|
|
156
|
+
|
|
157
|
+
## 贡献
|
|
158
|
+
|
|
159
|
+
欢迎提交 Issue 和 Pull Request!
|
|
160
|
+
|
|
161
|
+
## 链接
|
|
162
|
+
|
|
163
|
+
- [n8n](https://n8n.io) - 工作流自动化工具
|
|
164
|
+
- [n8n 文档](https://docs.n8n.io) - 官方文档
|
|
165
|
+
- [n8n 社区](https://community.n8n.io) - 社区论坛
|
|
@@ -0,0 +1,3 @@
|
|
|
1
|
+
<svg width="40" height="40" viewBox="0 0 40 40" fill="none" xmlns="http://www.w3.org/2000/svg">
|
|
2
|
+
<path fill-rule="evenodd" clip-rule="evenodd" d="M20.0165 0C8.94791 0 0 9.01388 0 20.1653C0 29.0792 5.73324 36.6246 13.6868 39.2952C14.6812 39.496 15.0454 38.8613 15.0454 38.3274C15.0454 37.8599 15.0126 36.2575 15.0126 34.5879C9.4445 35.79 8.28498 32.1841 8.28498 32.1841C7.39015 29.847 6.06429 29.2463 6.06429 29.2463C4.24185 28.011 6.19704 28.011 6.19704 28.011C8.21861 28.1446 9.27938 30.081 9.27938 30.081C11.0686 33.1522 13.9518 32.2844 15.1118 31.7502C15.2773 30.4481 15.8079 29.5467 16.3713 29.046C11.9303 28.5785 7.25781 26.8425 7.25781 19.0967C7.25781 16.8932 8.05267 15.0905 9.31216 13.6884C9.11344 13.1877 8.41732 11.1174 9.51128 8.34644C9.51128 8.34644 11.2014 7.81217 15.0122 10.4164C16.6438 9.97495 18.3263 9.7504 20.0165 9.74851C21.7067 9.74851 23.4295 9.98246 25.0205 10.4164C28.8317 7.81217 30.5218 8.34644 30.5218 8.34644C31.6158 11.1174 30.9192 13.1877 30.7205 13.6884C32.0132 15.0905 32.7753 16.8932 32.7753 19.0967C32.7753 26.8425 28.1028 28.5449 23.6287 29.046C24.358 29.6802 24.9873 30.882 24.9873 32.7851C24.9873 35.4893 24.9545 37.6596 24.9545 38.327C24.9545 38.8613 25.3192 39.496 26.3132 39.2956C34.2667 36.6242 39.9999 29.0792 39.9999 20.1653C40.0327 9.01388 31.052 0 20.0165 0Z" fill="white"/>
|
|
3
|
+
</svg>
|
|
@@ -0,0 +1,3 @@
|
|
|
1
|
+
<svg width="40" height="40" viewBox="0 0 40 40" fill="none" xmlns="http://www.w3.org/2000/svg">
|
|
2
|
+
<path fill-rule="evenodd" clip-rule="evenodd" d="M20.0165 0C8.94791 0 0 9.01388 0 20.1653C0 29.0792 5.73324 36.6246 13.6868 39.2952C14.6812 39.496 15.0454 38.8613 15.0454 38.3274C15.0454 37.8599 15.0126 36.2575 15.0126 34.5879C9.4445 35.79 8.28498 32.1841 8.28498 32.1841C7.39015 29.847 6.06429 29.2463 6.06429 29.2463C4.24185 28.011 6.19704 28.011 6.19704 28.011C8.21861 28.1446 9.27938 30.081 9.27938 30.081C11.0686 33.1522 13.9518 32.2844 15.1118 31.7502C15.2773 30.4481 15.8079 29.5467 16.3713 29.046C11.9303 28.5785 7.25781 26.8425 7.25781 19.0967C7.25781 16.8932 8.05267 15.0905 9.31216 13.6884C9.11344 13.1877 8.41732 11.1174 9.51128 8.34644C9.51128 8.34644 11.2014 7.81217 15.0122 10.4164C16.6438 9.97495 18.3263 9.7504 20.0165 9.74851C21.7067 9.74851 23.4295 9.98246 25.0205 10.4164C28.8317 7.81217 30.5218 8.34644 30.5218 8.34644C31.6158 11.1174 30.9192 13.1877 30.7205 13.6884C32.0132 15.0905 32.7753 16.8932 32.7753 19.0967C32.7753 26.8425 28.1028 28.5449 23.6287 29.046C24.358 29.6802 24.9873 30.882 24.9873 32.7851C24.9873 35.4893 24.9545 37.6596 24.9545 38.327C24.9545 38.8613 25.3192 39.496 26.3132 39.2956C34.2667 36.6242 39.9999 29.0792 39.9999 20.1653C40.0327 9.01388 31.052 0 20.0165 0Z" fill="#24292F"/>
|
|
3
|
+
</svg>
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
<svg width="40" height="40" viewBox="0 0 40 40" fill="none" xmlns="http://www.w3.org/2000/svg">
|
|
2
|
+
<rect x="2" y="4" width="36" height="32" rx="4" fill="#60A5FA"/>
|
|
3
|
+
<path d="M6 12H34" stroke="white" stroke-width="2" stroke-linecap="round"/>
|
|
4
|
+
<path d="M6 18H28" stroke="white" stroke-width="2" stroke-linecap="round"/>
|
|
5
|
+
<path d="M6 24H22" stroke="white" stroke-width="2" stroke-linecap="round"/>
|
|
6
|
+
<path d="M6 30H18" stroke="white" stroke-width="2" stroke-linecap="round"/>
|
|
7
|
+
<circle cx="32" cy="26" r="6" fill="#34D399"/>
|
|
8
|
+
<path d="M29 26L31 28L35 24" stroke="white" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/>
|
|
9
|
+
</svg>
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
<svg width="40" height="40" viewBox="0 0 40 40" fill="none" xmlns="http://www.w3.org/2000/svg">
|
|
2
|
+
<rect x="2" y="4" width="36" height="32" rx="4" fill="#3B82F6"/>
|
|
3
|
+
<path d="M6 12H34" stroke="white" stroke-width="2" stroke-linecap="round"/>
|
|
4
|
+
<path d="M6 18H28" stroke="white" stroke-width="2" stroke-linecap="round"/>
|
|
5
|
+
<path d="M6 24H22" stroke="white" stroke-width="2" stroke-linecap="round"/>
|
|
6
|
+
<path d="M6 30H18" stroke="white" stroke-width="2" stroke-linecap="round"/>
|
|
7
|
+
<circle cx="32" cy="26" r="6" fill="#10B981"/>
|
|
8
|
+
<path d="M29 26L31 28L35 24" stroke="white" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/>
|
|
9
|
+
</svg>
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
import type { IExecuteFunctions, INodeExecutionData, INodeType, INodeTypeDescription } from 'n8n-workflow';
|
|
2
|
+
export declare class SmartCrawler implements INodeType {
|
|
3
|
+
description: INodeTypeDescription;
|
|
4
|
+
execute(this: IExecuteFunctions): Promise<INodeExecutionData[][]>;
|
|
5
|
+
private static extractField;
|
|
6
|
+
private static extractValue;
|
|
7
|
+
private static getJumpConfig;
|
|
8
|
+
private static normalizeJumpConfig;
|
|
9
|
+
private static resolveUrl;
|
|
10
|
+
}
|