search-talent 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- search_talent-0.1.0/PKG-INFO +195 -0
- search_talent-0.1.0/README.md +169 -0
- search_talent-0.1.0/pyproject.toml +42 -0
- search_talent-0.1.0/src/search_talent/__init__.py +3 -0
- search_talent-0.1.0/src/search_talent/cli.py +69 -0
- search_talent-0.1.0/src/search_talent/reader.py +101 -0
- search_talent-0.1.0/src/search_talent/scraper.py +386 -0
|
@@ -0,0 +1,195 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: search-talent
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: LinkedIn Recruiter 搜索爬虫 & JSON 转 Excel 工具
|
|
5
|
+
Author-email: Chandler <275737875@qq.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Keywords: linkedin,recruiter,scraper,search,talent
|
|
8
|
+
Classifier: Development Status :: 3 - Alpha
|
|
9
|
+
Classifier: Intended Audience :: Developers
|
|
10
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
16
|
+
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
|
|
17
|
+
Classifier: Topic :: Office/Business
|
|
18
|
+
Requires-Python: >=3.9
|
|
19
|
+
Requires-Dist: beautifulsoup4>=4.12.0
|
|
20
|
+
Requires-Dist: browser-dog>=0.1.0
|
|
21
|
+
Requires-Dist: openpyxl>=3.0.0
|
|
22
|
+
Requires-Dist: pandas>=1.5.0
|
|
23
|
+
Requires-Dist: selenium>=4.0.0
|
|
24
|
+
Requires-Dist: typer>=0.9.0
|
|
25
|
+
Description-Content-Type: text/markdown
|
|
26
|
+
|
|
27
|
+
# search-talent
|
|
28
|
+
|
|
29
|
+
[](https://pypi.org/project/search-talent/)
|
|
30
|
+
[](https://pypi.org/project/search-talent/)
|
|
31
|
+
[](https://opensource.org/licenses/MIT)
|
|
32
|
+
|
|
33
|
+
基于 Selenium 的 LinkedIn Recruiter 搜索爬虫工具,可自动抓取候选人信息并导出为 Excel 文件。
|
|
34
|
+
|
|
35
|
+
## 功能特性
|
|
36
|
+
|
|
37
|
+
- **自动爬取**:模拟真人滚动行为,智能加载 LinkedIn Recruiter 搜索结果页的所有候选人信息。
|
|
38
|
+
- **结构化数据提取**:解析候选人姓名、头衔、地点、工作经历、教育背景、行业、关系度等字段。
|
|
39
|
+
- **断点续爬**:已保存的页面 JSON 文件会自动跳过,支持中断后继续爬取。
|
|
40
|
+
- **Excel 导出**:一键将 JSON 数据整理并导出为格式化的 `.xlsx` 文件,驼峰命名字段自动拆分,方便阅读。
|
|
41
|
+
- **CLI 驱动**:安装后直接通过终端命令 `search-talent` 运行,无需编写额外脚本。
|
|
42
|
+
|
|
43
|
+
## 安装
|
|
44
|
+
|
|
45
|
+
### 通过 pip 安装(推荐)
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
pip install search-talent
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
### 从源码安装
|
|
52
|
+
|
|
53
|
+
```bash
|
|
54
|
+
git clone https://github.com/your-username/search-talent.git
|
|
55
|
+
cd search-talent
|
|
56
|
+
pip install .
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### 依赖项
|
|
60
|
+
|
|
61
|
+
安装时会自动安装以下依赖:
|
|
62
|
+
|
|
63
|
+
| 依赖 | 用途 |
|
|
64
|
+
|------|------|
|
|
65
|
+
| `typer` | CLI 框架 |
|
|
66
|
+
| `selenium` | 浏览器自动化 |
|
|
67
|
+
| `beautifulsoup4` | HTML 解析 |
|
|
68
|
+
| `pandas` | 数据处理 |
|
|
69
|
+
| `openpyxl` | Excel 文件读写(pandas 后端) |
|
|
70
|
+
| `browser-dog` | LinkedIn 登录与 Cookie 管理 |
|
|
71
|
+
|
|
72
|
+
> **注意**:运行爬虫需要本机已安装 Chrome/Chromium 浏览器及对应版本的 ChromeDriver。
|
|
73
|
+
|
|
74
|
+
## 前置准备
|
|
75
|
+
|
|
76
|
+
### Cookie 文件
|
|
77
|
+
|
|
78
|
+
本工具通过 Cookie 注入方式登录 LinkedIn Recruiter,你需要提前准备好一个 `cookie.json` 文件。
|
|
79
|
+
|
|
80
|
+
文件内容为 JSON 数组格式,每条记录是一个 Cookie 对象:
|
|
81
|
+
|
|
82
|
+
```json
|
|
83
|
+
[
|
|
84
|
+
{
|
|
85
|
+
"domain": ".linkedin.com",
|
|
86
|
+
"name": "li_at",
|
|
87
|
+
"value": "YOUR_COOKIE_VALUE",
|
|
88
|
+
"path": "/",
|
|
89
|
+
"secure": true,
|
|
90
|
+
"httpOnly": true
|
|
91
|
+
}
|
|
92
|
+
]
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
> **提示**:可使用浏览器扩展(如 EditThisCookie)导出已登录的 LinkedIn Cookie。
|
|
96
|
+
|
|
97
|
+
## CLI 使用指南
|
|
98
|
+
|
|
99
|
+
安装完成后,可在终端直接运行 `search-talent` 命令。
|
|
100
|
+
|
|
101
|
+
### 查看帮助
|
|
102
|
+
|
|
103
|
+
```bash
|
|
104
|
+
search-talent --help
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### `scrape` 命令 - 爬取 LinkedIn 数据
|
|
108
|
+
|
|
109
|
+
从 LinkedIn Recruiter 搜索结果页爬取候选人信息,并保存为 JSON 文件。
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
search-talent scrape \
|
|
113
|
+
--start-url "https://www.linkedin.com/talent/search?q=..." \
|
|
114
|
+
--search-keyword "python-developer" \
|
|
115
|
+
--folder ./output
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
**参数说明:**
|
|
119
|
+
|
|
120
|
+
| 参数 | 缩写 | 说明 | 默认值 |
|
|
121
|
+
|------|------|------|--------|
|
|
122
|
+
| `--start-url` | `-s` | LinkedIn Recruiter 搜索结果 URL(必填) | — |
|
|
123
|
+
| `--search-keyword` | `-k` | 搜索关键词,用于 JSON 文件命名(必填) | — |
|
|
124
|
+
| `--folder` | `-f` | 保存 JSON 文件的本地目录(必填) | — |
|
|
125
|
+
| `--cookie` | `-c` | Cookie JSON 文件路径 | `cookie.json` |
|
|
126
|
+
| `--scroll-times` | `-n` | 页面滚动次数 | `3` |
|
|
127
|
+
| `--max-pages` | `-m` | 最大爬取页数(不指定则不限制) | `None` |
|
|
128
|
+
| `--headless` | — | 以无头模式运行浏览器 | `False` |
|
|
129
|
+
|
|
130
|
+
**示例:**
|
|
131
|
+
|
|
132
|
+
```bash
|
|
133
|
+
# 爬取前 5 页,使用自定义 Cookie 文件
|
|
134
|
+
search-talent scrape \
|
|
135
|
+
-s "https://www.linkedin.com/talent/search?q=engineer&location=China" \
|
|
136
|
+
-k "engineer-china" \
|
|
137
|
+
-f ./data/engineer-china \
|
|
138
|
+
-c ~/linkedin_cookie.json \
|
|
139
|
+
-m 5
|
|
140
|
+
|
|
141
|
+
# 无头模式运行
|
|
142
|
+
search-talent scrape \
|
|
143
|
+
-s "https://www.linkedin.com/talent/search?q=..." \
|
|
144
|
+
-k "my-search" \
|
|
145
|
+
-f ./data \
|
|
146
|
+
--headless
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
### `export` 命令 - 导出为 Excel
|
|
150
|
+
|
|
151
|
+
将 `scrape` 命令生成的 JSON 文件合并导出为一份 Excel 文件。
|
|
152
|
+
|
|
153
|
+
```bash
|
|
154
|
+
search-talent export \
|
|
155
|
+
--input ./output \
|
|
156
|
+
--output candidates.xlsx
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
**参数说明:**
|
|
160
|
+
|
|
161
|
+
| 参数 | 缩写 | 说明 | 默认值 |
|
|
162
|
+
|------|------|------|--------|
|
|
163
|
+
| `--input` | `-i` | 包含 JSON 文件的输入目录(必填) | — |
|
|
164
|
+
| `--output` | `-o` | 输出 Excel 文件路径(必填) | — |
|
|
165
|
+
|
|
166
|
+
**示例:**
|
|
167
|
+
|
|
168
|
+
```bash
|
|
169
|
+
search-talent export -i ./data/engineer-china -o engineer-china.xlsx
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
## 项目结构
|
|
173
|
+
|
|
174
|
+
```
|
|
175
|
+
search-talent/
|
|
176
|
+
├── pyproject.toml # 项目配置与构建定义
|
|
177
|
+
├── README.md
|
|
178
|
+
└── src/
|
|
179
|
+
└── search_talent/
|
|
180
|
+
├── __init__.py # 包入口,版本信息
|
|
181
|
+
├── scraper.py # LinkedIn 爬虫核心模块
|
|
182
|
+
├── reader.py # JSON 读取与 Excel 导出模块
|
|
183
|
+
└── cli.py # Typer CLI 入口
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
## 注意事项
|
|
187
|
+
|
|
188
|
+
- **合规使用**:请遵守 LinkedIn 的使用条款,合理控制爬取频率和数量,避免账号被封禁。
|
|
189
|
+
- **Cookie 有效期**:LinkedIn Cookie 会过期,若登录失败请重新导出 Cookie 文件。
|
|
190
|
+
- **反爬机制**:工具内置了随机等待、拟人滚动等反检测策略,但仍建议控制爬取速度。
|
|
191
|
+
- **ChromeDriver 版本**:请确保本机 ChromeDriver 版本与 Chrome 浏览器版本匹配。
|
|
192
|
+
|
|
193
|
+
## 许可证
|
|
194
|
+
|
|
195
|
+
[MIT License](https://opensource.org/licenses/MIT)
|
|
@@ -0,0 +1,169 @@
|
|
|
1
|
+
# search-talent
|
|
2
|
+
|
|
3
|
+
[](https://pypi.org/project/search-talent/)
|
|
4
|
+
[](https://pypi.org/project/search-talent/)
|
|
5
|
+
[](https://opensource.org/licenses/MIT)
|
|
6
|
+
|
|
7
|
+
基于 Selenium 的 LinkedIn Recruiter 搜索爬虫工具,可自动抓取候选人信息并导出为 Excel 文件。
|
|
8
|
+
|
|
9
|
+
## 功能特性
|
|
10
|
+
|
|
11
|
+
- **自动爬取**:模拟真人滚动行为,智能加载 LinkedIn Recruiter 搜索结果页的所有候选人信息。
|
|
12
|
+
- **结构化数据提取**:解析候选人姓名、头衔、地点、工作经历、教育背景、行业、关系度等字段。
|
|
13
|
+
- **断点续爬**:已保存的页面 JSON 文件会自动跳过,支持中断后继续爬取。
|
|
14
|
+
- **Excel 导出**:一键将 JSON 数据整理并导出为格式化的 `.xlsx` 文件,驼峰命名字段自动拆分,方便阅读。
|
|
15
|
+
- **CLI 驱动**:安装后直接通过终端命令 `search-talent` 运行,无需编写额外脚本。
|
|
16
|
+
|
|
17
|
+
## 安装
|
|
18
|
+
|
|
19
|
+
### 通过 pip 安装(推荐)
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
pip install search-talent
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
### 从源码安装
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
git clone https://github.com/your-username/search-talent.git
|
|
29
|
+
cd search-talent
|
|
30
|
+
pip install .
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
### 依赖项
|
|
34
|
+
|
|
35
|
+
安装时会自动安装以下依赖:
|
|
36
|
+
|
|
37
|
+
| 依赖 | 用途 |
|
|
38
|
+
|------|------|
|
|
39
|
+
| `typer` | CLI 框架 |
|
|
40
|
+
| `selenium` | 浏览器自动化 |
|
|
41
|
+
| `beautifulsoup4` | HTML 解析 |
|
|
42
|
+
| `pandas` | 数据处理 |
|
|
43
|
+
| `openpyxl` | Excel 文件读写(pandas 后端) |
|
|
44
|
+
| `browser-dog` | LinkedIn 登录与 Cookie 管理 |
|
|
45
|
+
|
|
46
|
+
> **注意**:运行爬虫需要本机已安装 Chrome/Chromium 浏览器及对应版本的 ChromeDriver。
|
|
47
|
+
|
|
48
|
+
## 前置准备
|
|
49
|
+
|
|
50
|
+
### Cookie 文件
|
|
51
|
+
|
|
52
|
+
本工具通过 Cookie 注入方式登录 LinkedIn Recruiter,你需要提前准备好一个 `cookie.json` 文件。
|
|
53
|
+
|
|
54
|
+
文件内容为 JSON 数组格式,每条记录是一个 Cookie 对象:
|
|
55
|
+
|
|
56
|
+
```json
|
|
57
|
+
[
|
|
58
|
+
{
|
|
59
|
+
"domain": ".linkedin.com",
|
|
60
|
+
"name": "li_at",
|
|
61
|
+
"value": "YOUR_COOKIE_VALUE",
|
|
62
|
+
"path": "/",
|
|
63
|
+
"secure": true,
|
|
64
|
+
"httpOnly": true
|
|
65
|
+
}
|
|
66
|
+
]
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
> **提示**:可使用浏览器扩展(如 EditThisCookie)导出已登录的 LinkedIn Cookie。
|
|
70
|
+
|
|
71
|
+
## CLI 使用指南
|
|
72
|
+
|
|
73
|
+
安装完成后,可在终端直接运行 `search-talent` 命令。
|
|
74
|
+
|
|
75
|
+
### 查看帮助
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
search-talent --help
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
### `scrape` 命令 - 爬取 LinkedIn 数据
|
|
82
|
+
|
|
83
|
+
从 LinkedIn Recruiter 搜索结果页爬取候选人信息,并保存为 JSON 文件。
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
search-talent scrape \
|
|
87
|
+
--start-url "https://www.linkedin.com/talent/search?q=..." \
|
|
88
|
+
--search-keyword "python-developer" \
|
|
89
|
+
--folder ./output
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
**参数说明:**
|
|
93
|
+
|
|
94
|
+
| 参数 | 缩写 | 说明 | 默认值 |
|
|
95
|
+
|------|------|------|--------|
|
|
96
|
+
| `--start-url` | `-s` | LinkedIn Recruiter 搜索结果 URL(必填) | — |
|
|
97
|
+
| `--search-keyword` | `-k` | 搜索关键词,用于 JSON 文件命名(必填) | — |
|
|
98
|
+
| `--folder` | `-f` | 保存 JSON 文件的本地目录(必填) | — |
|
|
99
|
+
| `--cookie` | `-c` | Cookie JSON 文件路径 | `cookie.json` |
|
|
100
|
+
| `--scroll-times` | `-n` | 页面滚动次数 | `3` |
|
|
101
|
+
| `--max-pages` | `-m` | 最大爬取页数(不指定则不限制) | `None` |
|
|
102
|
+
| `--headless` | — | 以无头模式运行浏览器 | `False` |
|
|
103
|
+
|
|
104
|
+
**示例:**
|
|
105
|
+
|
|
106
|
+
```bash
|
|
107
|
+
# 爬取前 5 页,使用自定义 Cookie 文件
|
|
108
|
+
search-talent scrape \
|
|
109
|
+
-s "https://www.linkedin.com/talent/search?q=engineer&location=China" \
|
|
110
|
+
-k "engineer-china" \
|
|
111
|
+
-f ./data/engineer-china \
|
|
112
|
+
-c ~/linkedin_cookie.json \
|
|
113
|
+
-m 5
|
|
114
|
+
|
|
115
|
+
# 无头模式运行
|
|
116
|
+
search-talent scrape \
|
|
117
|
+
-s "https://www.linkedin.com/talent/search?q=..." \
|
|
118
|
+
-k "my-search" \
|
|
119
|
+
-f ./data \
|
|
120
|
+
--headless
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
### `export` 命令 - 导出为 Excel
|
|
124
|
+
|
|
125
|
+
将 `scrape` 命令生成的 JSON 文件合并导出为一份 Excel 文件。
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
search-talent export \
|
|
129
|
+
--input ./output \
|
|
130
|
+
--output candidates.xlsx
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
**参数说明:**
|
|
134
|
+
|
|
135
|
+
| 参数 | 缩写 | 说明 | 默认值 |
|
|
136
|
+
|------|------|------|--------|
|
|
137
|
+
| `--input` | `-i` | 包含 JSON 文件的输入目录(必填) | — |
|
|
138
|
+
| `--output` | `-o` | 输出 Excel 文件路径(必填) | — |
|
|
139
|
+
|
|
140
|
+
**示例:**
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
search-talent export -i ./data/engineer-china -o engineer-china.xlsx
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
## 项目结构
|
|
147
|
+
|
|
148
|
+
```
|
|
149
|
+
search-talent/
|
|
150
|
+
├── pyproject.toml # 项目配置与构建定义
|
|
151
|
+
├── README.md
|
|
152
|
+
└── src/
|
|
153
|
+
└── search_talent/
|
|
154
|
+
├── __init__.py # 包入口,版本信息
|
|
155
|
+
├── scraper.py # LinkedIn 爬虫核心模块
|
|
156
|
+
├── reader.py # JSON 读取与 Excel 导出模块
|
|
157
|
+
└── cli.py # Typer CLI 入口
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
## 注意事项
|
|
161
|
+
|
|
162
|
+
- **合规使用**:请遵守 LinkedIn 的使用条款,合理控制爬取频率和数量,避免账号被封禁。
|
|
163
|
+
- **Cookie 有效期**:LinkedIn Cookie 会过期,若登录失败请重新导出 Cookie 文件。
|
|
164
|
+
- **反爬机制**:工具内置了随机等待、拟人滚动等反检测策略,但仍建议控制爬取速度。
|
|
165
|
+
- **ChromeDriver 版本**:请确保本机 ChromeDriver 版本与 Chrome 浏览器版本匹配。
|
|
166
|
+
|
|
167
|
+
## 许可证
|
|
168
|
+
|
|
169
|
+
[MIT License](https://opensource.org/licenses/MIT)
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["hatchling"]
|
|
3
|
+
build-backend = "hatchling.build"
|
|
4
|
+
|
|
5
|
+
[tool.hatch.build.targets.wheel]
|
|
6
|
+
packages = ["src/search_talent"]
|
|
7
|
+
|
|
8
|
+
[project]
|
|
9
|
+
name = "search-talent"
|
|
10
|
+
version = "0.1.0"
|
|
11
|
+
description = "LinkedIn Recruiter 搜索爬虫 & JSON 转 Excel 工具"
|
|
12
|
+
readme = "README.md"
|
|
13
|
+
requires-python = ">=3.9"
|
|
14
|
+
license = { text = "MIT" }
|
|
15
|
+
authors = [
|
|
16
|
+
{ name = "Chandler", email = "275737875@qq.com" },
|
|
17
|
+
]
|
|
18
|
+
keywords = ["linkedin", "recruiter", "scraper", "talent", "search"]
|
|
19
|
+
classifiers = [
|
|
20
|
+
"Development Status :: 3 - Alpha",
|
|
21
|
+
"Intended Audience :: Developers",
|
|
22
|
+
"License :: OSI Approved :: MIT License",
|
|
23
|
+
"Programming Language :: Python :: 3",
|
|
24
|
+
"Programming Language :: Python :: 3.9",
|
|
25
|
+
"Programming Language :: Python :: 3.10",
|
|
26
|
+
"Programming Language :: Python :: 3.11",
|
|
27
|
+
"Programming Language :: Python :: 3.12",
|
|
28
|
+
"Topic :: Internet :: WWW/HTTP :: Browsers",
|
|
29
|
+
"Topic :: Office/Business",
|
|
30
|
+
]
|
|
31
|
+
dependencies = [
|
|
32
|
+
"typer>=0.9.0",
|
|
33
|
+
"selenium>=4.0.0",
|
|
34
|
+
"beautifulsoup4>=4.12.0",
|
|
35
|
+
"pandas>=1.5.0",
|
|
36
|
+
"openpyxl>=3.0.0",
|
|
37
|
+
"browser-dog>=0.1.0",
|
|
38
|
+
]
|
|
39
|
+
|
|
40
|
+
[project.scripts]
|
|
41
|
+
search-talent = "search_talent.cli:app"
|
|
42
|
+
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
"""Talent Search CLI - 基于 Typer 的命令行工具"""
|
|
2
|
+
|
|
3
|
+
from typing import Optional
|
|
4
|
+
|
|
5
|
+
import typer
|
|
6
|
+
|
|
7
|
+
from .scraper import (
|
|
8
|
+
LinkedInProfileParser,
|
|
9
|
+
LinkedInPageParser,
|
|
10
|
+
SeleniumPageNavigator,
|
|
11
|
+
JsonDataSaver,
|
|
12
|
+
LinkedInScraper,
|
|
13
|
+
login,
|
|
14
|
+
)
|
|
15
|
+
from .reader import read_json_files_and_save_to_excel
|
|
16
|
+
|
|
17
|
+
app = typer.Typer(
|
|
18
|
+
name="talent-search",
|
|
19
|
+
help="LinkedIn Recruiter 搜索爬虫 & JSON 转 Excel 工具",
|
|
20
|
+
add_completion=False,
|
|
21
|
+
)
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
@app.command()
|
|
25
|
+
def scrape(
|
|
26
|
+
start_url: str = typer.Option(..., "--start-url", "-s", help="LinkedIn search URL"),
|
|
27
|
+
search_keyword: str = typer.Option(..., "--search-keyword", "-k", help="Search keyword for file naming"),
|
|
28
|
+
folder: str = typer.Option(..., "--folder", "-f", help="Local folder to save scraped data"),
|
|
29
|
+
cookie_file: str = typer.Option("cookie.json", "--cookie", "-c", help="Cookie JSON file path"),
|
|
30
|
+
scroll_times: int = typer.Option(3, "--scroll-times", "-n", help="Number of times to scroll the page"),
|
|
31
|
+
max_pages: Optional[int] = typer.Option(None, "--max-pages", "-m", help="Maximum pages to scrape (None=unlimited)"),
|
|
32
|
+
headless: bool = typer.Option(False, "--headless", help="Run browser in headless mode"),
|
|
33
|
+
):
|
|
34
|
+
"""Scrape LinkedIn Recruiter search results and save as JSON files."""
|
|
35
|
+
profile_parser = LinkedInProfileParser()
|
|
36
|
+
page_parser = LinkedInPageParser(profile_parser)
|
|
37
|
+
navigator = SeleniumPageNavigator()
|
|
38
|
+
data_saver = JsonDataSaver()
|
|
39
|
+
|
|
40
|
+
driver = login(cookie_file=cookie_file, headless=headless)
|
|
41
|
+
|
|
42
|
+
scraper = LinkedInScraper(driver, page_parser, navigator, data_saver)
|
|
43
|
+
scraper.scrape(
|
|
44
|
+
start_url=start_url,
|
|
45
|
+
search_keyword=search_keyword,
|
|
46
|
+
folder=folder,
|
|
47
|
+
scroll_times=scroll_times,
|
|
48
|
+
max_pages=max_pages,
|
|
49
|
+
)
|
|
50
|
+
|
|
51
|
+
typer.echo(f"Scraping completed. Data saved to: {folder}")
|
|
52
|
+
|
|
53
|
+
|
|
54
|
+
@app.command()
|
|
55
|
+
def export(
|
|
56
|
+
input: str = typer.Option(..., "--input", "-i", help="Input folder containing JSON files"),
|
|
57
|
+
output: str = typer.Option(..., "--output", "-o", help="Output Excel file path (e.g. output.xlsx)"),
|
|
58
|
+
):
|
|
59
|
+
"""Read JSON files from folder and export to Excel."""
|
|
60
|
+
try:
|
|
61
|
+
read_json_files_and_save_to_excel(input, output)
|
|
62
|
+
typer.echo(f"Export completed: {output}")
|
|
63
|
+
except Exception as e:
|
|
64
|
+
typer.echo(f"Error: {e}", err=True)
|
|
65
|
+
raise typer.Exit(code=1)
|
|
66
|
+
|
|
67
|
+
|
|
68
|
+
if __name__ == "__main__":
|
|
69
|
+
app()
|
|
@@ -0,0 +1,101 @@
|
|
|
1
|
+
"""JSON 文件读取与 Excel 导出模块"""
|
|
2
|
+
|
|
3
|
+
import json
|
|
4
|
+
import os
|
|
5
|
+
import re
|
|
6
|
+
from typing import Optional
|
|
7
|
+
|
|
8
|
+
import pandas as pd
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
def split_camel_case(word: str) -> str:
|
|
12
|
+
"""将驼峰命名字符串拆分为用空格分隔的单词。
|
|
13
|
+
|
|
14
|
+
例如: 'DigitalSalesStrategy' -> 'Digital Sales Strategy'
|
|
15
|
+
"""
|
|
16
|
+
return re.sub(r"(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", " ", word)
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
def split_compound_words(text: str) -> str:
|
|
20
|
+
"""识别字符串中的驼峰式复合词,并将其拆分为带空格的单词序列,其余部分原样保留。"""
|
|
21
|
+
pattern = r"[A-Z][a-z]+(?:[A-Z][a-z]+)+"
|
|
22
|
+
|
|
23
|
+
def replacer(match):
|
|
24
|
+
return split_camel_case(match.group(0))
|
|
25
|
+
|
|
26
|
+
return re.sub(pattern, replacer, text)
|
|
27
|
+
|
|
28
|
+
|
|
29
|
+
def col_format(exps: list) -> str:
|
|
30
|
+
"""格式化 experiences / education 列表为可读字符串"""
|
|
31
|
+
exp_list = []
|
|
32
|
+
for exp in exps:
|
|
33
|
+
exp_format = exp["title_company"].replace("\u00b7", " (") + ")"
|
|
34
|
+
exp_format_str = split_compound_words(exp_format)
|
|
35
|
+
exp_list.append(exp_format_str)
|
|
36
|
+
return "\n".join(exp_list)
|
|
37
|
+
|
|
38
|
+
|
|
39
|
+
def read_json_files_and_save_to_excel(folder_path: str, output_excel_path: str) -> None:
|
|
40
|
+
"""读取指定文件夹下所有 JSON 文件(每个是 list),保存为 Excel 文件。
|
|
41
|
+
|
|
42
|
+
参数:
|
|
43
|
+
folder_path: JSON 文件所在文件夹路径
|
|
44
|
+
output_excel_path: 输出的 Excel 文件路径
|
|
45
|
+
"""
|
|
46
|
+
if not os.path.exists(folder_path):
|
|
47
|
+
raise FileNotFoundError(f"文件夹不存在: {folder_path}")
|
|
48
|
+
|
|
49
|
+
all_data = []
|
|
50
|
+
|
|
51
|
+
for filename in os.listdir(folder_path):
|
|
52
|
+
file_path = os.path.join(folder_path, filename)
|
|
53
|
+
|
|
54
|
+
if not (os.path.isfile(file_path) and filename.lower().endswith(".json")):
|
|
55
|
+
continue
|
|
56
|
+
|
|
57
|
+
try:
|
|
58
|
+
with open(file_path, "r", encoding="utf-8") as f:
|
|
59
|
+
data = json.load(f)
|
|
60
|
+
|
|
61
|
+
if not isinstance(data, list):
|
|
62
|
+
print(f"跳过: {filename},内容不是 list。")
|
|
63
|
+
continue
|
|
64
|
+
|
|
65
|
+
for item in data:
|
|
66
|
+
if isinstance(item, dict):
|
|
67
|
+
item["source_file"] = filename
|
|
68
|
+
else:
|
|
69
|
+
item = {"value": item, "source_file": filename}
|
|
70
|
+
|
|
71
|
+
exp = col_format(item["experiences"])
|
|
72
|
+
edu = col_format(item["education"])
|
|
73
|
+
item["experiences"] = exp
|
|
74
|
+
item["education"] = edu
|
|
75
|
+
|
|
76
|
+
title = split_compound_words(item["headline"])
|
|
77
|
+
item["headline"] = title
|
|
78
|
+
all_data.append(item)
|
|
79
|
+
|
|
80
|
+
print(f"已读取: {filename} -> {len(data)} 个元素")
|
|
81
|
+
|
|
82
|
+
except Exception as e:
|
|
83
|
+
print(f"读取失败: {filename}, 错误: {e}")
|
|
84
|
+
|
|
85
|
+
if all_data:
|
|
86
|
+
df = pd.DataFrame(all_data)
|
|
87
|
+
columns_order = [
|
|
88
|
+
"name",
|
|
89
|
+
"headline",
|
|
90
|
+
"location",
|
|
91
|
+
"experiences",
|
|
92
|
+
"education",
|
|
93
|
+
"profile_url",
|
|
94
|
+
"industry",
|
|
95
|
+
"source_file",
|
|
96
|
+
]
|
|
97
|
+
df = df[columns_order]
|
|
98
|
+
df.to_excel(output_excel_path, index=False)
|
|
99
|
+
print(f"\n所有数据已成功保存至: {output_excel_path}")
|
|
100
|
+
else:
|
|
101
|
+
print("没有有效数据可保存。")
|
|
@@ -0,0 +1,386 @@
|
|
|
1
|
+
"""LinkedIn Recruiter 搜索爬虫模块"""
|
|
2
|
+
|
|
3
|
+
from abc import ABC, abstractmethod
|
|
4
|
+
import json
|
|
5
|
+
import os
|
|
6
|
+
import random
|
|
7
|
+
import re
|
|
8
|
+
import time
|
|
9
|
+
from typing import Dict, List, Optional
|
|
10
|
+
|
|
11
|
+
from bs4 import BeautifulSoup
|
|
12
|
+
from selenium.webdriver.remote.webdriver import WebDriver
|
|
13
|
+
from selenium.webdriver.support.ui import WebDriverWait
|
|
14
|
+
from selenium.webdriver.support import expected_conditions as EC
|
|
15
|
+
from selenium.common.exceptions import TimeoutException
|
|
16
|
+
from selenium.webdriver.common.by import By
|
|
17
|
+
|
|
18
|
+
from browser_dog.browser import BrowserDog
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
# ===================== 登录 =====================
|
|
22
|
+
|
|
23
|
+
def login(cookie_file: str = "cookie.json", headless: bool = False) -> WebDriver:
|
|
24
|
+
"""使用 BrowserDog 登录 LinkedIn Talent"""
|
|
25
|
+
dog = BrowserDog(
|
|
26
|
+
cookies_json=cookie_file,
|
|
27
|
+
base_url="https://www.linkedin.com/talent/home",
|
|
28
|
+
headless=headless,
|
|
29
|
+
)
|
|
30
|
+
|
|
31
|
+
if dog.detect_block_or_captcha():
|
|
32
|
+
raise RuntimeError("Login failed or blocked!")
|
|
33
|
+
|
|
34
|
+
driver = dog.get_driver()
|
|
35
|
+
driver.maximize_window()
|
|
36
|
+
print("正在打开 LinkedIn Talent 首页...")
|
|
37
|
+
driver.get("https://www.linkedin.com/talent/home")
|
|
38
|
+
dog.long_wait()
|
|
39
|
+
return driver
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
# ===================== 1. 解析器接口 =====================
|
|
43
|
+
|
|
44
|
+
class ProfileParser(ABC):
|
|
45
|
+
"""个人资料解析器抽象接口"""
|
|
46
|
+
|
|
47
|
+
@abstractmethod
|
|
48
|
+
def parse(self, soup: BeautifulSoup) -> Dict:
|
|
49
|
+
"""解析单个个人资料卡片"""
|
|
50
|
+
pass
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
class LinkedInProfileParser(ProfileParser):
|
|
54
|
+
"""LinkedIn 个人资料卡片解析器"""
|
|
55
|
+
|
|
56
|
+
def parse(self, soup: BeautifulSoup) -> Dict:
|
|
57
|
+
data = {
|
|
58
|
+
"name": self._extract_name(soup),
|
|
59
|
+
"profile_url": self._extract_profile_url(soup),
|
|
60
|
+
"avatar_url": self._extract_avatar_url(soup),
|
|
61
|
+
"headline": self._extract_headline(soup),
|
|
62
|
+
"location": self._extract_location(soup),
|
|
63
|
+
"industry": self._extract_industry(soup),
|
|
64
|
+
"connection_degree": self._extract_connection_degree(soup),
|
|
65
|
+
"experiences": self._extract_experiences(soup),
|
|
66
|
+
"education": self._extract_education(soup),
|
|
67
|
+
"decorations": self._extract_decorations(soup),
|
|
68
|
+
}
|
|
69
|
+
return data
|
|
70
|
+
|
|
71
|
+
def _extract_name(self, soup: BeautifulSoup) -> Optional[str]:
|
|
72
|
+
elem = soup.select_one('[data-test-row-lockup-full-name] a')
|
|
73
|
+
return elem.get_text(strip=True) if elem else None
|
|
74
|
+
|
|
75
|
+
def _extract_profile_url(self, soup: BeautifulSoup) -> Optional[str]:
|
|
76
|
+
elem = soup.select_one('[data-test-row-lockup-full-name] a')
|
|
77
|
+
return elem.get("href") if elem else None
|
|
78
|
+
|
|
79
|
+
def _extract_avatar_url(self, soup: BeautifulSoup) -> Optional[str]:
|
|
80
|
+
elem = soup.select_one("[data-test-lockup-image]")
|
|
81
|
+
return elem.get("src") if elem else None
|
|
82
|
+
|
|
83
|
+
def _extract_headline(self, soup: BeautifulSoup) -> Optional[str]:
|
|
84
|
+
elem = soup.select_one("[data-test-row-lockup-headline]")
|
|
85
|
+
return elem.get_text(strip=True) if elem else None
|
|
86
|
+
|
|
87
|
+
def _extract_location(self, soup: BeautifulSoup) -> Optional[str]:
|
|
88
|
+
elem = soup.select_one("[data-test-row-lockup-location]")
|
|
89
|
+
return elem.get_text(strip=True) if elem else None
|
|
90
|
+
|
|
91
|
+
def _extract_industry(self, soup: BeautifulSoup) -> Optional[str]:
|
|
92
|
+
elem = soup.select_one("[data-test-current-employer-industry]")
|
|
93
|
+
return elem.get_text(strip=True) if elem else None
|
|
94
|
+
|
|
95
|
+
def _extract_connection_degree(self, soup: BeautifulSoup) -> Optional[str]:
|
|
96
|
+
elem = soup.select_one('.artdeco-entity-lockup__badge span[aria-hidden="true"]')
|
|
97
|
+
if elem:
|
|
98
|
+
text = elem.get_text(strip=True)
|
|
99
|
+
return re.sub(r"^·\s*", "", text).strip()
|
|
100
|
+
badge = soup.select_one(".artdeco-entity-lockup__badge")
|
|
101
|
+
if badge:
|
|
102
|
+
text = badge.get_text(strip=True)
|
|
103
|
+
match = re.search(r"(\d+(?:st|nd|rd|th))", text)
|
|
104
|
+
return match.group(1) if match else text
|
|
105
|
+
return None
|
|
106
|
+
|
|
107
|
+
def _extract_experiences(self, soup: BeautifulSoup) -> List[Dict]:
|
|
108
|
+
return self._extract_history_group(soup, "Experience")
|
|
109
|
+
|
|
110
|
+
def _extract_education(self, soup: BeautifulSoup) -> List[Dict]:
|
|
111
|
+
return self._extract_history_group(soup, "Education")
|
|
112
|
+
|
|
113
|
+
def _extract_history_group(self, soup: BeautifulSoup, group_title: str) -> List[Dict]:
|
|
114
|
+
groups = soup.select(".history-group")
|
|
115
|
+
for group in groups:
|
|
116
|
+
title_elem = group.select_one(".history-group__term")
|
|
117
|
+
if title_elem and group_title in title_elem.get_text():
|
|
118
|
+
return [self._parse_history_item(li) for li in group.select(".history-group__list-items li")]
|
|
119
|
+
return []
|
|
120
|
+
|
|
121
|
+
def _parse_history_item(self, li: BeautifulSoup) -> Dict:
|
|
122
|
+
main_text = li.get_text(strip=True)
|
|
123
|
+
date_elem = li.select_one(".description-entry__date-duration")
|
|
124
|
+
date_range = None
|
|
125
|
+
if date_elem:
|
|
126
|
+
times = date_elem.find_all("time")
|
|
127
|
+
if len(times) == 2:
|
|
128
|
+
date_range = f"{times[0].get_text(strip=True)} – {times[1].get_text(strip=True)}"
|
|
129
|
+
elif len(times) == 1:
|
|
130
|
+
date_range = times[0].get_text(strip=True)
|
|
131
|
+
return {"title": main_text, "date_range": date_range}
|
|
132
|
+
|
|
133
|
+
def _extract_decorations(self, soup: BeautifulSoup) -> Dict:
|
|
134
|
+
decorations: Dict[str, List[str]] = {"interest": [], "activity": []}
|
|
135
|
+
for row in soup.select(".decorations__row"):
|
|
136
|
+
label = row.select_one(".decorations__row-title")
|
|
137
|
+
if not label:
|
|
138
|
+
continue
|
|
139
|
+
label_text = label.get_text(strip=True).lower()
|
|
140
|
+
triggers = row.select(".base-decoration__trigger-text")
|
|
141
|
+
if label_text == "interest":
|
|
142
|
+
decorations["interest"] = [t.get_text(strip=True) for t in triggers]
|
|
143
|
+
elif label_text == "activity":
|
|
144
|
+
decorations["activity"] = [t.get_text(strip=True) for t in triggers]
|
|
145
|
+
return decorations
|
|
146
|
+
|
|
147
|
+
|
|
148
|
+
# ===================== 2. 页面解析器 =====================
|
|
149
|
+
|
|
150
|
+
class PageParser(ABC):
|
|
151
|
+
"""页面解析器抽象接口"""
|
|
152
|
+
|
|
153
|
+
@abstractmethod
|
|
154
|
+
def parse_profiles(self, page_source: str) -> List[Dict]:
|
|
155
|
+
"""从页面源码中解析所有个人资料"""
|
|
156
|
+
pass
|
|
157
|
+
|
|
158
|
+
|
|
159
|
+
class LinkedInPageParser(PageParser):
|
|
160
|
+
"""LinkedIn 搜索结果页面解析器"""
|
|
161
|
+
|
|
162
|
+
def __init__(self, profile_parser: ProfileParser):
|
|
163
|
+
self.profile_parser = profile_parser
|
|
164
|
+
|
|
165
|
+
def parse_profiles(self, page_source: str) -> List[Dict]:
|
|
166
|
+
soup = BeautifulSoup(page_source, "html.parser")
|
|
167
|
+
profile_items = soup.select("li.profile-list__border-bottom")
|
|
168
|
+
return [self.profile_parser.parse(item) for item in profile_items]
|
|
169
|
+
|
|
170
|
+
|
|
171
|
+
# ===================== 3. 页面导航接口 =====================
|
|
172
|
+
|
|
173
|
+
class PageNavigator(ABC):
|
|
174
|
+
"""页面导航抽象接口"""
|
|
175
|
+
|
|
176
|
+
@abstractmethod
|
|
177
|
+
def scroll_and_wait(self, driver: WebDriver, **kwargs) -> None:
|
|
178
|
+
"""滚动页面并等待"""
|
|
179
|
+
pass
|
|
180
|
+
|
|
181
|
+
@abstractmethod
|
|
182
|
+
def click_next(self, driver: WebDriver, timeout: int = 10) -> bool:
|
|
183
|
+
"""点击下一页按钮,返回是否成功"""
|
|
184
|
+
pass
|
|
185
|
+
|
|
186
|
+
|
|
187
|
+
class SeleniumPageNavigator(PageNavigator):
|
|
188
|
+
"""基于 Selenium 的页面导航实现"""
|
|
189
|
+
|
|
190
|
+
def scroll_and_wait(
|
|
191
|
+
self,
|
|
192
|
+
driver: WebDriver,
|
|
193
|
+
max_scroll_rounds: int = 30,
|
|
194
|
+
height_stable_count: int = 3,
|
|
195
|
+
min_wait: float = 1.0,
|
|
196
|
+
max_wait: float = 3.0,
|
|
197
|
+
scroll_step_ratio: tuple = (0.3, 0.8),
|
|
198
|
+
) -> None:
|
|
199
|
+
"""动态滚动加载页面所有内容"""
|
|
200
|
+
print("开始智能滚动加载(动态检测内容)...")
|
|
201
|
+
|
|
202
|
+
viewport_height = driver.execute_script("return window.innerHeight")
|
|
203
|
+
|
|
204
|
+
def get_total_height() -> int:
|
|
205
|
+
return driver.execute_script("return document.body.scrollHeight")
|
|
206
|
+
|
|
207
|
+
def get_scroll_y() -> int:
|
|
208
|
+
return driver.execute_script("return window.scrollY")
|
|
209
|
+
|
|
210
|
+
def scroll_to(y: int) -> None:
|
|
211
|
+
driver.execute_script(f"window.scrollTo(0, {y});")
|
|
212
|
+
|
|
213
|
+
def scroll_by(delta: int) -> None:
|
|
214
|
+
driver.execute_script(f"window.scrollBy(0, {delta});")
|
|
215
|
+
|
|
216
|
+
def random_wait() -> None:
|
|
217
|
+
time.sleep(random.uniform(min_wait, max_wait))
|
|
218
|
+
|
|
219
|
+
last_height = get_total_height()
|
|
220
|
+
stable_rounds = 0
|
|
221
|
+
current_round = 0
|
|
222
|
+
current_pos = get_scroll_y()
|
|
223
|
+
|
|
224
|
+
while current_round < max_scroll_rounds and stable_rounds < height_stable_count:
|
|
225
|
+
step = random.uniform(*scroll_step_ratio) * viewport_height
|
|
226
|
+
target_y = min(current_pos + step, last_height)
|
|
227
|
+
|
|
228
|
+
if last_height - target_y < viewport_height * 0.2:
|
|
229
|
+
target_y = last_height
|
|
230
|
+
|
|
231
|
+
scroll_to(target_y)
|
|
232
|
+
random_wait()
|
|
233
|
+
|
|
234
|
+
current_pos = get_scroll_y()
|
|
235
|
+
new_height = get_total_height()
|
|
236
|
+
|
|
237
|
+
if new_height > last_height:
|
|
238
|
+
print(f"轮次 {current_round + 1}: 高度从 {last_height} 增加到 {new_height}")
|
|
239
|
+
last_height = new_height
|
|
240
|
+
stable_rounds = 0
|
|
241
|
+
if random.random() < 0.3:
|
|
242
|
+
back_distance = random.uniform(0.1, 0.3) * viewport_height
|
|
243
|
+
scroll_by(-back_distance)
|
|
244
|
+
random_wait()
|
|
245
|
+
scroll_by(back_distance)
|
|
246
|
+
random_wait()
|
|
247
|
+
else:
|
|
248
|
+
stable_rounds += 1
|
|
249
|
+
print(f"轮次 {current_round + 1}: 高度未变化 ({new_height}),稳定计数 {stable_rounds}")
|
|
250
|
+
|
|
251
|
+
if (last_height - current_pos) < viewport_height * 0.1 and stable_rounds < height_stable_count:
|
|
252
|
+
print("接近底部但高度未变,强制滚动到底部并等待加载...")
|
|
253
|
+
scroll_to(last_height)
|
|
254
|
+
random_wait()
|
|
255
|
+
new_height = get_total_height()
|
|
256
|
+
if new_height > last_height:
|
|
257
|
+
last_height = new_height
|
|
258
|
+
stable_rounds = 0
|
|
259
|
+
current_pos = get_scroll_y()
|
|
260
|
+
|
|
261
|
+
current_round += 1
|
|
262
|
+
|
|
263
|
+
scroll_to(get_total_height())
|
|
264
|
+
time.sleep(random.uniform(1.5, 3.0))
|
|
265
|
+
|
|
266
|
+
if random.random() < 0.5:
|
|
267
|
+
print("最终回滚到中部并再次滚动到底部...")
|
|
268
|
+
scroll_to(get_total_height() // 2)
|
|
269
|
+
time.sleep(random.uniform(0.5, 1.5))
|
|
270
|
+
scroll_to(get_total_height())
|
|
271
|
+
time.sleep(random.uniform(1.0, 2.0))
|
|
272
|
+
|
|
273
|
+
print("智能滚动加载完成,页面所有内容已尝试加载。")
|
|
274
|
+
|
|
275
|
+
def click_next(self, driver: WebDriver, timeout: int = 10) -> bool:
|
|
276
|
+
try:
|
|
277
|
+
next_btn = WebDriverWait(driver, timeout).until(
|
|
278
|
+
EC.element_to_be_clickable((By.CSS_SELECTOR, "[data-test-pagination-next]"))
|
|
279
|
+
)
|
|
280
|
+
if next_btn.get_attribute("aria-disabled") == "true" or next_btn.get_attribute("disabled") is not None:
|
|
281
|
+
print("Next 按钮已禁用,没有更多页面。")
|
|
282
|
+
return False
|
|
283
|
+
driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", next_btn)
|
|
284
|
+
time.sleep(0.5)
|
|
285
|
+
next_btn.click()
|
|
286
|
+
time.sleep(5)
|
|
287
|
+
print("已点击 Next 按钮")
|
|
288
|
+
time.sleep(3)
|
|
289
|
+
return True
|
|
290
|
+
except TimeoutException:
|
|
291
|
+
print("未找到 Next 按钮或按钮不可点击,到达最后一页。")
|
|
292
|
+
return False
|
|
293
|
+
except Exception as e:
|
|
294
|
+
print(f"点击 Next 按钮时出错: {e}")
|
|
295
|
+
return False
|
|
296
|
+
|
|
297
|
+
|
|
298
|
+
# ===================== 4. 数据持久化接口 =====================
|
|
299
|
+
|
|
300
|
+
class DataSaver(ABC):
|
|
301
|
+
"""数据保存抽象接口"""
|
|
302
|
+
|
|
303
|
+
@abstractmethod
|
|
304
|
+
def save(self, data: List[Dict], page_num: int, search_keyword: str = "", folder: Optional[str] = None) -> None:
|
|
305
|
+
pass
|
|
306
|
+
|
|
307
|
+
|
|
308
|
+
class JsonDataSaver(DataSaver):
|
|
309
|
+
"""JSON 格式保存实现"""
|
|
310
|
+
|
|
311
|
+
def save(self, data: List[Dict], page_num: int, search_keyword: str = "", folder: Optional[str] = None) -> None:
|
|
312
|
+
filename = f"{search_keyword}_page_{page_num}.json" if search_keyword else f"profiles_page_{page_num}.json"
|
|
313
|
+
if folder:
|
|
314
|
+
os.makedirs(folder, exist_ok=True)
|
|
315
|
+
filepath = os.path.join(folder, filename)
|
|
316
|
+
else:
|
|
317
|
+
filepath = filename
|
|
318
|
+
with open(filepath, "w", encoding="utf-8") as f:
|
|
319
|
+
json.dump(data, f, ensure_ascii=False, indent=2)
|
|
320
|
+
print(f"已保存 {len(data)} 条个人资料到 {filepath}")
|
|
321
|
+
|
|
322
|
+
|
|
323
|
+
# ===================== 5. 爬虫核心 =====================
|
|
324
|
+
|
|
325
|
+
class LinkedInScraper:
|
|
326
|
+
"""LinkedIn 搜索结果爬虫,协调各组件完成爬取任务"""
|
|
327
|
+
|
|
328
|
+
def __init__(
|
|
329
|
+
self,
|
|
330
|
+
driver: WebDriver,
|
|
331
|
+
page_parser: PageParser,
|
|
332
|
+
navigator: PageNavigator,
|
|
333
|
+
data_saver: DataSaver,
|
|
334
|
+
):
|
|
335
|
+
self.driver = driver
|
|
336
|
+
self.page_parser = page_parser
|
|
337
|
+
self.navigator = navigator
|
|
338
|
+
self.data_saver = data_saver
|
|
339
|
+
|
|
340
|
+
def _check_json_list_length(self, page_num: int, search_keyword: str, folder: Optional[str]) -> int:
|
|
341
|
+
filename = f"{search_keyword}_page_{page_num}.json" if search_keyword else f"profiles_page_{page_num}.json"
|
|
342
|
+
if folder:
|
|
343
|
+
os.makedirs(folder, exist_ok=True)
|
|
344
|
+
filepath = os.path.join(folder, filename)
|
|
345
|
+
else:
|
|
346
|
+
filepath = filename
|
|
347
|
+
|
|
348
|
+
if os.path.exists(filepath):
|
|
349
|
+
print(f"文件 {filepath} 存在,跳过...")
|
|
350
|
+
return 1
|
|
351
|
+
else:
|
|
352
|
+
print(f"文件 {filepath} 不存在,返回长度 0。")
|
|
353
|
+
return 0
|
|
354
|
+
|
|
355
|
+
def scrape(
|
|
356
|
+
self,
|
|
357
|
+
start_url: str,
|
|
358
|
+
search_keyword: str = "",
|
|
359
|
+
folder: Optional[str] = None,
|
|
360
|
+
scroll_times: int = 100,
|
|
361
|
+
max_pages: Optional[int] = None,
|
|
362
|
+
) -> None:
|
|
363
|
+
"""执行爬取"""
|
|
364
|
+
page_num = 1
|
|
365
|
+
self.driver.get(start_url)
|
|
366
|
+
time.sleep(10)
|
|
367
|
+
json_list_length = self._check_json_list_length(page_num, search_keyword, folder)
|
|
368
|
+
|
|
369
|
+
if json_list_length != 1:
|
|
370
|
+
self.navigator.scroll_and_wait(self.driver)
|
|
371
|
+
|
|
372
|
+
self._process_current_page(page_num, search_keyword, folder)
|
|
373
|
+
|
|
374
|
+
while (max_pages is None or page_num < max_pages) and self.navigator.click_next(self.driver):
|
|
375
|
+
page_num += 1
|
|
376
|
+
self._process_current_page(page_num, search_keyword, folder)
|
|
377
|
+
|
|
378
|
+
def _process_current_page(self, page_num: int, search_keyword: str, folder: Optional[str]) -> None:
|
|
379
|
+
json_list_length = self._check_json_list_length(page_num, search_keyword, folder)
|
|
380
|
+
if json_list_length == 1:
|
|
381
|
+
return
|
|
382
|
+
|
|
383
|
+
self.navigator.scroll_and_wait(self.driver)
|
|
384
|
+
page_source = self.driver.page_source
|
|
385
|
+
profiles = self.page_parser.parse_profiles(page_source)
|
|
386
|
+
self.data_saver.save(profiles, page_num, search_keyword, folder)
|