search-talent 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,195 @@
1
+ Metadata-Version: 2.4
2
+ Name: search-talent
3
+ Version: 0.1.0
4
+ Summary: LinkedIn Recruiter 搜索爬虫 & JSON 转 Excel 工具
5
+ Author-email: Chandler <275737875@qq.com>
6
+ License: MIT
7
+ Keywords: linkedin,recruiter,scraper,search,talent
8
+ Classifier: Development Status :: 3 - Alpha
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Programming Language :: Python :: 3.9
13
+ Classifier: Programming Language :: Python :: 3.10
14
+ Classifier: Programming Language :: Python :: 3.11
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
17
+ Classifier: Topic :: Office/Business
18
+ Requires-Python: >=3.9
19
+ Requires-Dist: beautifulsoup4>=4.12.0
20
+ Requires-Dist: browser-dog>=0.1.0
21
+ Requires-Dist: openpyxl>=3.0.0
22
+ Requires-Dist: pandas>=1.5.0
23
+ Requires-Dist: selenium>=4.0.0
24
+ Requires-Dist: typer>=0.9.0
25
+ Description-Content-Type: text/markdown
26
+
27
+ # search-talent
28
+
29
+ [![PyPI version](https://img.shields.io/pypi/v/search-talent.svg)](https://pypi.org/project/search-talent/)
30
+ [![Python](https://img.shields.io/pypi/pyversions/search-talent.svg)](https://pypi.org/project/search-talent/)
31
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
32
+
33
+ 基于 Selenium 的 LinkedIn Recruiter 搜索爬虫工具,可自动抓取候选人信息并导出为 Excel 文件。
34
+
35
+ ## 功能特性
36
+
37
+ - **自动爬取**:模拟真人滚动行为,智能加载 LinkedIn Recruiter 搜索结果页的所有候选人信息。
38
+ - **结构化数据提取**:解析候选人姓名、头衔、地点、工作经历、教育背景、行业、关系度等字段。
39
+ - **断点续爬**:已保存的页面 JSON 文件会自动跳过,支持中断后继续爬取。
40
+ - **Excel 导出**:一键将 JSON 数据整理并导出为格式化的 `.xlsx` 文件,驼峰命名字段自动拆分,方便阅读。
41
+ - **CLI 驱动**:安装后直接通过终端命令 `search-talent` 运行,无需编写额外脚本。
42
+
43
+ ## 安装
44
+
45
+ ### 通过 pip 安装(推荐)
46
+
47
+ ```bash
48
+ pip install search-talent
49
+ ```
50
+
51
+ ### 从源码安装
52
+
53
+ ```bash
54
+ git clone https://github.com/your-username/search-talent.git
55
+ cd search-talent
56
+ pip install .
57
+ ```
58
+
59
+ ### 依赖项
60
+
61
+ 安装时会自动安装以下依赖:
62
+
63
+ | 依赖 | 用途 |
64
+ |------|------|
65
+ | `typer` | CLI 框架 |
66
+ | `selenium` | 浏览器自动化 |
67
+ | `beautifulsoup4` | HTML 解析 |
68
+ | `pandas` | 数据处理 |
69
+ | `openpyxl` | Excel 文件读写(pandas 后端) |
70
+ | `browser-dog` | LinkedIn 登录与 Cookie 管理 |
71
+
72
+ > **注意**:运行爬虫需要本机已安装 Chrome/Chromium 浏览器及对应版本的 ChromeDriver。
73
+
74
+ ## 前置准备
75
+
76
+ ### Cookie 文件
77
+
78
+ 本工具通过 Cookie 注入方式登录 LinkedIn Recruiter,你需要提前准备好一个 `cookie.json` 文件。
79
+
80
+ 文件内容为 JSON 数组格式,每条记录是一个 Cookie 对象:
81
+
82
+ ```json
83
+ [
84
+ {
85
+ "domain": ".linkedin.com",
86
+ "name": "li_at",
87
+ "value": "YOUR_COOKIE_VALUE",
88
+ "path": "/",
89
+ "secure": true,
90
+ "httpOnly": true
91
+ }
92
+ ]
93
+ ```
94
+
95
+ > **提示**:可使用浏览器扩展(如 EditThisCookie)导出已登录的 LinkedIn Cookie。
96
+
97
+ ## CLI 使用指南
98
+
99
+ 安装完成后,可在终端直接运行 `search-talent` 命令。
100
+
101
+ ### 查看帮助
102
+
103
+ ```bash
104
+ search-talent --help
105
+ ```
106
+
107
+ ### `scrape` 命令 - 爬取 LinkedIn 数据
108
+
109
+ 从 LinkedIn Recruiter 搜索结果页爬取候选人信息,并保存为 JSON 文件。
110
+
111
+ ```bash
112
+ search-talent scrape \
113
+ --start-url "https://www.linkedin.com/talent/search?q=..." \
114
+ --search-keyword "python-developer" \
115
+ --folder ./output
116
+ ```
117
+
118
+ **参数说明:**
119
+
120
+ | 参数 | 缩写 | 说明 | 默认值 |
121
+ |------|------|------|--------|
122
+ | `--start-url` | `-s` | LinkedIn Recruiter 搜索结果 URL(必填) | — |
123
+ | `--search-keyword` | `-k` | 搜索关键词,用于 JSON 文件命名(必填) | — |
124
+ | `--folder` | `-f` | 保存 JSON 文件的本地目录(必填) | — |
125
+ | `--cookie` | `-c` | Cookie JSON 文件路径 | `cookie.json` |
126
+ | `--scroll-times` | `-n` | 页面滚动次数 | `3` |
127
+ | `--max-pages` | `-m` | 最大爬取页数(不指定则不限制) | `None` |
128
+ | `--headless` | — | 以无头模式运行浏览器 | `False` |
129
+
130
+ **示例:**
131
+
132
+ ```bash
133
+ # 爬取前 5 页,使用自定义 Cookie 文件
134
+ search-talent scrape \
135
+ -s "https://www.linkedin.com/talent/search?q=engineer&location=China" \
136
+ -k "engineer-china" \
137
+ -f ./data/engineer-china \
138
+ -c ~/linkedin_cookie.json \
139
+ -m 5
140
+
141
+ # 无头模式运行
142
+ search-talent scrape \
143
+ -s "https://www.linkedin.com/talent/search?q=..." \
144
+ -k "my-search" \
145
+ -f ./data \
146
+ --headless
147
+ ```
148
+
149
+ ### `export` 命令 - 导出为 Excel
150
+
151
+ 将 `scrape` 命令生成的 JSON 文件合并导出为一份 Excel 文件。
152
+
153
+ ```bash
154
+ search-talent export \
155
+ --input ./output \
156
+ --output candidates.xlsx
157
+ ```
158
+
159
+ **参数说明:**
160
+
161
+ | 参数 | 缩写 | 说明 | 默认值 |
162
+ |------|------|------|--------|
163
+ | `--input` | `-i` | 包含 JSON 文件的输入目录(必填) | — |
164
+ | `--output` | `-o` | 输出 Excel 文件路径(必填) | — |
165
+
166
+ **示例:**
167
+
168
+ ```bash
169
+ search-talent export -i ./data/engineer-china -o engineer-china.xlsx
170
+ ```
171
+
172
+ ## 项目结构
173
+
174
+ ```
175
+ search-talent/
176
+ ├── pyproject.toml # 项目配置与构建定义
177
+ ├── README.md
178
+ └── src/
179
+ └── search_talent/
180
+ ├── __init__.py # 包入口,版本信息
181
+ ├── scraper.py # LinkedIn 爬虫核心模块
182
+ ├── reader.py # JSON 读取与 Excel 导出模块
183
+ └── cli.py # Typer CLI 入口
184
+ ```
185
+
186
+ ## 注意事项
187
+
188
+ - **合规使用**:请遵守 LinkedIn 的使用条款,合理控制爬取频率和数量,避免账号被封禁。
189
+ - **Cookie 有效期**:LinkedIn Cookie 会过期,若登录失败请重新导出 Cookie 文件。
190
+ - **反爬机制**:工具内置了随机等待、拟人滚动等反检测策略,但仍建议控制爬取速度。
191
+ - **ChromeDriver 版本**:请确保本机 ChromeDriver 版本与 Chrome 浏览器版本匹配。
192
+
193
+ ## 许可证
194
+
195
+ [MIT License](https://opensource.org/licenses/MIT)
@@ -0,0 +1,169 @@
1
+ # search-talent
2
+
3
+ [![PyPI version](https://img.shields.io/pypi/v/search-talent.svg)](https://pypi.org/project/search-talent/)
4
+ [![Python](https://img.shields.io/pypi/pyversions/search-talent.svg)](https://pypi.org/project/search-talent/)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
+
7
+ 基于 Selenium 的 LinkedIn Recruiter 搜索爬虫工具,可自动抓取候选人信息并导出为 Excel 文件。
8
+
9
+ ## 功能特性
10
+
11
+ - **自动爬取**:模拟真人滚动行为,智能加载 LinkedIn Recruiter 搜索结果页的所有候选人信息。
12
+ - **结构化数据提取**:解析候选人姓名、头衔、地点、工作经历、教育背景、行业、关系度等字段。
13
+ - **断点续爬**:已保存的页面 JSON 文件会自动跳过,支持中断后继续爬取。
14
+ - **Excel 导出**:一键将 JSON 数据整理并导出为格式化的 `.xlsx` 文件,驼峰命名字段自动拆分,方便阅读。
15
+ - **CLI 驱动**:安装后直接通过终端命令 `search-talent` 运行,无需编写额外脚本。
16
+
17
+ ## 安装
18
+
19
+ ### 通过 pip 安装(推荐)
20
+
21
+ ```bash
22
+ pip install search-talent
23
+ ```
24
+
25
+ ### 从源码安装
26
+
27
+ ```bash
28
+ git clone https://github.com/your-username/search-talent.git
29
+ cd search-talent
30
+ pip install .
31
+ ```
32
+
33
+ ### 依赖项
34
+
35
+ 安装时会自动安装以下依赖:
36
+
37
+ | 依赖 | 用途 |
38
+ |------|------|
39
+ | `typer` | CLI 框架 |
40
+ | `selenium` | 浏览器自动化 |
41
+ | `beautifulsoup4` | HTML 解析 |
42
+ | `pandas` | 数据处理 |
43
+ | `openpyxl` | Excel 文件读写(pandas 后端) |
44
+ | `browser-dog` | LinkedIn 登录与 Cookie 管理 |
45
+
46
+ > **注意**:运行爬虫需要本机已安装 Chrome/Chromium 浏览器及对应版本的 ChromeDriver。
47
+
48
+ ## 前置准备
49
+
50
+ ### Cookie 文件
51
+
52
+ 本工具通过 Cookie 注入方式登录 LinkedIn Recruiter,你需要提前准备好一个 `cookie.json` 文件。
53
+
54
+ 文件内容为 JSON 数组格式,每条记录是一个 Cookie 对象:
55
+
56
+ ```json
57
+ [
58
+ {
59
+ "domain": ".linkedin.com",
60
+ "name": "li_at",
61
+ "value": "YOUR_COOKIE_VALUE",
62
+ "path": "/",
63
+ "secure": true,
64
+ "httpOnly": true
65
+ }
66
+ ]
67
+ ```
68
+
69
+ > **提示**:可使用浏览器扩展(如 EditThisCookie)导出已登录的 LinkedIn Cookie。
70
+
71
+ ## CLI 使用指南
72
+
73
+ 安装完成后,可在终端直接运行 `search-talent` 命令。
74
+
75
+ ### 查看帮助
76
+
77
+ ```bash
78
+ search-talent --help
79
+ ```
80
+
81
+ ### `scrape` 命令 - 爬取 LinkedIn 数据
82
+
83
+ 从 LinkedIn Recruiter 搜索结果页爬取候选人信息,并保存为 JSON 文件。
84
+
85
+ ```bash
86
+ search-talent scrape \
87
+ --start-url "https://www.linkedin.com/talent/search?q=..." \
88
+ --search-keyword "python-developer" \
89
+ --folder ./output
90
+ ```
91
+
92
+ **参数说明:**
93
+
94
+ | 参数 | 缩写 | 说明 | 默认值 |
95
+ |------|------|------|--------|
96
+ | `--start-url` | `-s` | LinkedIn Recruiter 搜索结果 URL(必填) | — |
97
+ | `--search-keyword` | `-k` | 搜索关键词,用于 JSON 文件命名(必填) | — |
98
+ | `--folder` | `-f` | 保存 JSON 文件的本地目录(必填) | — |
99
+ | `--cookie` | `-c` | Cookie JSON 文件路径 | `cookie.json` |
100
+ | `--scroll-times` | `-n` | 页面滚动次数 | `3` |
101
+ | `--max-pages` | `-m` | 最大爬取页数(不指定则不限制) | `None` |
102
+ | `--headless` | — | 以无头模式运行浏览器 | `False` |
103
+
104
+ **示例:**
105
+
106
+ ```bash
107
+ # 爬取前 5 页,使用自定义 Cookie 文件
108
+ search-talent scrape \
109
+ -s "https://www.linkedin.com/talent/search?q=engineer&location=China" \
110
+ -k "engineer-china" \
111
+ -f ./data/engineer-china \
112
+ -c ~/linkedin_cookie.json \
113
+ -m 5
114
+
115
+ # 无头模式运行
116
+ search-talent scrape \
117
+ -s "https://www.linkedin.com/talent/search?q=..." \
118
+ -k "my-search" \
119
+ -f ./data \
120
+ --headless
121
+ ```
122
+
123
+ ### `export` 命令 - 导出为 Excel
124
+
125
+ 将 `scrape` 命令生成的 JSON 文件合并导出为一份 Excel 文件。
126
+
127
+ ```bash
128
+ search-talent export \
129
+ --input ./output \
130
+ --output candidates.xlsx
131
+ ```
132
+
133
+ **参数说明:**
134
+
135
+ | 参数 | 缩写 | 说明 | 默认值 |
136
+ |------|------|------|--------|
137
+ | `--input` | `-i` | 包含 JSON 文件的输入目录(必填) | — |
138
+ | `--output` | `-o` | 输出 Excel 文件路径(必填) | — |
139
+
140
+ **示例:**
141
+
142
+ ```bash
143
+ search-talent export -i ./data/engineer-china -o engineer-china.xlsx
144
+ ```
145
+
146
+ ## 项目结构
147
+
148
+ ```
149
+ search-talent/
150
+ ├── pyproject.toml # 项目配置与构建定义
151
+ ├── README.md
152
+ └── src/
153
+ └── search_talent/
154
+ ├── __init__.py # 包入口,版本信息
155
+ ├── scraper.py # LinkedIn 爬虫核心模块
156
+ ├── reader.py # JSON 读取与 Excel 导出模块
157
+ └── cli.py # Typer CLI 入口
158
+ ```
159
+
160
+ ## 注意事项
161
+
162
+ - **合规使用**:请遵守 LinkedIn 的使用条款,合理控制爬取频率和数量,避免账号被封禁。
163
+ - **Cookie 有效期**:LinkedIn Cookie 会过期,若登录失败请重新导出 Cookie 文件。
164
+ - **反爬机制**:工具内置了随机等待、拟人滚动等反检测策略,但仍建议控制爬取速度。
165
+ - **ChromeDriver 版本**:请确保本机 ChromeDriver 版本与 Chrome 浏览器版本匹配。
166
+
167
+ ## 许可证
168
+
169
+ [MIT License](https://opensource.org/licenses/MIT)
@@ -0,0 +1,42 @@
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [tool.hatch.build.targets.wheel]
6
+ packages = ["src/search_talent"]
7
+
8
+ [project]
9
+ name = "search-talent"
10
+ version = "0.1.0"
11
+ description = "LinkedIn Recruiter 搜索爬虫 & JSON 转 Excel 工具"
12
+ readme = "README.md"
13
+ requires-python = ">=3.9"
14
+ license = { text = "MIT" }
15
+ authors = [
16
+ { name = "Chandler", email = "275737875@qq.com" },
17
+ ]
18
+ keywords = ["linkedin", "recruiter", "scraper", "talent", "search"]
19
+ classifiers = [
20
+ "Development Status :: 3 - Alpha",
21
+ "Intended Audience :: Developers",
22
+ "License :: OSI Approved :: MIT License",
23
+ "Programming Language :: Python :: 3",
24
+ "Programming Language :: Python :: 3.9",
25
+ "Programming Language :: Python :: 3.10",
26
+ "Programming Language :: Python :: 3.11",
27
+ "Programming Language :: Python :: 3.12",
28
+ "Topic :: Internet :: WWW/HTTP :: Browsers",
29
+ "Topic :: Office/Business",
30
+ ]
31
+ dependencies = [
32
+ "typer>=0.9.0",
33
+ "selenium>=4.0.0",
34
+ "beautifulsoup4>=4.12.0",
35
+ "pandas>=1.5.0",
36
+ "openpyxl>=3.0.0",
37
+ "browser-dog>=0.1.0",
38
+ ]
39
+
40
+ [project.scripts]
41
+ search-talent = "search_talent.cli:app"
42
+
@@ -0,0 +1,3 @@
1
+ """search-talent: LinkedIn Recruiter 搜索爬虫 & JSON 转 Excel 工具"""
2
+
3
+ __version__ = "0.1.0"
@@ -0,0 +1,69 @@
1
+ """Talent Search CLI - 基于 Typer 的命令行工具"""
2
+
3
+ from typing import Optional
4
+
5
+ import typer
6
+
7
+ from .scraper import (
8
+ LinkedInProfileParser,
9
+ LinkedInPageParser,
10
+ SeleniumPageNavigator,
11
+ JsonDataSaver,
12
+ LinkedInScraper,
13
+ login,
14
+ )
15
+ from .reader import read_json_files_and_save_to_excel
16
+
17
+ app = typer.Typer(
18
+ name="talent-search",
19
+ help="LinkedIn Recruiter 搜索爬虫 & JSON 转 Excel 工具",
20
+ add_completion=False,
21
+ )
22
+
23
+
24
+ @app.command()
25
+ def scrape(
26
+ start_url: str = typer.Option(..., "--start-url", "-s", help="LinkedIn search URL"),
27
+ search_keyword: str = typer.Option(..., "--search-keyword", "-k", help="Search keyword for file naming"),
28
+ folder: str = typer.Option(..., "--folder", "-f", help="Local folder to save scraped data"),
29
+ cookie_file: str = typer.Option("cookie.json", "--cookie", "-c", help="Cookie JSON file path"),
30
+ scroll_times: int = typer.Option(3, "--scroll-times", "-n", help="Number of times to scroll the page"),
31
+ max_pages: Optional[int] = typer.Option(None, "--max-pages", "-m", help="Maximum pages to scrape (None=unlimited)"),
32
+ headless: bool = typer.Option(False, "--headless", help="Run browser in headless mode"),
33
+ ):
34
+ """Scrape LinkedIn Recruiter search results and save as JSON files."""
35
+ profile_parser = LinkedInProfileParser()
36
+ page_parser = LinkedInPageParser(profile_parser)
37
+ navigator = SeleniumPageNavigator()
38
+ data_saver = JsonDataSaver()
39
+
40
+ driver = login(cookie_file=cookie_file, headless=headless)
41
+
42
+ scraper = LinkedInScraper(driver, page_parser, navigator, data_saver)
43
+ scraper.scrape(
44
+ start_url=start_url,
45
+ search_keyword=search_keyword,
46
+ folder=folder,
47
+ scroll_times=scroll_times,
48
+ max_pages=max_pages,
49
+ )
50
+
51
+ typer.echo(f"Scraping completed. Data saved to: {folder}")
52
+
53
+
54
+ @app.command()
55
+ def export(
56
+ input: str = typer.Option(..., "--input", "-i", help="Input folder containing JSON files"),
57
+ output: str = typer.Option(..., "--output", "-o", help="Output Excel file path (e.g. output.xlsx)"),
58
+ ):
59
+ """Read JSON files from folder and export to Excel."""
60
+ try:
61
+ read_json_files_and_save_to_excel(input, output)
62
+ typer.echo(f"Export completed: {output}")
63
+ except Exception as e:
64
+ typer.echo(f"Error: {e}", err=True)
65
+ raise typer.Exit(code=1)
66
+
67
+
68
+ if __name__ == "__main__":
69
+ app()
@@ -0,0 +1,101 @@
1
+ """JSON 文件读取与 Excel 导出模块"""
2
+
3
+ import json
4
+ import os
5
+ import re
6
+ from typing import Optional
7
+
8
+ import pandas as pd
9
+
10
+
11
+ def split_camel_case(word: str) -> str:
12
+ """将驼峰命名字符串拆分为用空格分隔的单词。
13
+
14
+ 例如: 'DigitalSalesStrategy' -> 'Digital Sales Strategy'
15
+ """
16
+ return re.sub(r"(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", " ", word)
17
+
18
+
19
+ def split_compound_words(text: str) -> str:
20
+ """识别字符串中的驼峰式复合词,并将其拆分为带空格的单词序列,其余部分原样保留。"""
21
+ pattern = r"[A-Z][a-z]+(?:[A-Z][a-z]+)+"
22
+
23
+ def replacer(match):
24
+ return split_camel_case(match.group(0))
25
+
26
+ return re.sub(pattern, replacer, text)
27
+
28
+
29
+ def col_format(exps: list) -> str:
30
+ """格式化 experiences / education 列表为可读字符串"""
31
+ exp_list = []
32
+ for exp in exps:
33
+ exp_format = exp["title_company"].replace("\u00b7", " (") + ")"
34
+ exp_format_str = split_compound_words(exp_format)
35
+ exp_list.append(exp_format_str)
36
+ return "\n".join(exp_list)
37
+
38
+
39
+ def read_json_files_and_save_to_excel(folder_path: str, output_excel_path: str) -> None:
40
+ """读取指定文件夹下所有 JSON 文件(每个是 list),保存为 Excel 文件。
41
+
42
+ 参数:
43
+ folder_path: JSON 文件所在文件夹路径
44
+ output_excel_path: 输出的 Excel 文件路径
45
+ """
46
+ if not os.path.exists(folder_path):
47
+ raise FileNotFoundError(f"文件夹不存在: {folder_path}")
48
+
49
+ all_data = []
50
+
51
+ for filename in os.listdir(folder_path):
52
+ file_path = os.path.join(folder_path, filename)
53
+
54
+ if not (os.path.isfile(file_path) and filename.lower().endswith(".json")):
55
+ continue
56
+
57
+ try:
58
+ with open(file_path, "r", encoding="utf-8") as f:
59
+ data = json.load(f)
60
+
61
+ if not isinstance(data, list):
62
+ print(f"跳过: {filename},内容不是 list。")
63
+ continue
64
+
65
+ for item in data:
66
+ if isinstance(item, dict):
67
+ item["source_file"] = filename
68
+ else:
69
+ item = {"value": item, "source_file": filename}
70
+
71
+ exp = col_format(item["experiences"])
72
+ edu = col_format(item["education"])
73
+ item["experiences"] = exp
74
+ item["education"] = edu
75
+
76
+ title = split_compound_words(item["headline"])
77
+ item["headline"] = title
78
+ all_data.append(item)
79
+
80
+ print(f"已读取: {filename} -> {len(data)} 个元素")
81
+
82
+ except Exception as e:
83
+ print(f"读取失败: {filename}, 错误: {e}")
84
+
85
+ if all_data:
86
+ df = pd.DataFrame(all_data)
87
+ columns_order = [
88
+ "name",
89
+ "headline",
90
+ "location",
91
+ "experiences",
92
+ "education",
93
+ "profile_url",
94
+ "industry",
95
+ "source_file",
96
+ ]
97
+ df = df[columns_order]
98
+ df.to_excel(output_excel_path, index=False)
99
+ print(f"\n所有数据已成功保存至: {output_excel_path}")
100
+ else:
101
+ print("没有有效数据可保存。")
@@ -0,0 +1,386 @@
1
+ """LinkedIn Recruiter 搜索爬虫模块"""
2
+
3
+ from abc import ABC, abstractmethod
4
+ import json
5
+ import os
6
+ import random
7
+ import re
8
+ import time
9
+ from typing import Dict, List, Optional
10
+
11
+ from bs4 import BeautifulSoup
12
+ from selenium.webdriver.remote.webdriver import WebDriver
13
+ from selenium.webdriver.support.ui import WebDriverWait
14
+ from selenium.webdriver.support import expected_conditions as EC
15
+ from selenium.common.exceptions import TimeoutException
16
+ from selenium.webdriver.common.by import By
17
+
18
+ from browser_dog.browser import BrowserDog
19
+
20
+
21
+ # ===================== 登录 =====================
22
+
23
+ def login(cookie_file: str = "cookie.json", headless: bool = False) -> WebDriver:
24
+ """使用 BrowserDog 登录 LinkedIn Talent"""
25
+ dog = BrowserDog(
26
+ cookies_json=cookie_file,
27
+ base_url="https://www.linkedin.com/talent/home",
28
+ headless=headless,
29
+ )
30
+
31
+ if dog.detect_block_or_captcha():
32
+ raise RuntimeError("Login failed or blocked!")
33
+
34
+ driver = dog.get_driver()
35
+ driver.maximize_window()
36
+ print("正在打开 LinkedIn Talent 首页...")
37
+ driver.get("https://www.linkedin.com/talent/home")
38
+ dog.long_wait()
39
+ return driver
40
+
41
+
42
+ # ===================== 1. 解析器接口 =====================
43
+
44
+ class ProfileParser(ABC):
45
+ """个人资料解析器抽象接口"""
46
+
47
+ @abstractmethod
48
+ def parse(self, soup: BeautifulSoup) -> Dict:
49
+ """解析单个个人资料卡片"""
50
+ pass
51
+
52
+
53
+ class LinkedInProfileParser(ProfileParser):
54
+ """LinkedIn 个人资料卡片解析器"""
55
+
56
+ def parse(self, soup: BeautifulSoup) -> Dict:
57
+ data = {
58
+ "name": self._extract_name(soup),
59
+ "profile_url": self._extract_profile_url(soup),
60
+ "avatar_url": self._extract_avatar_url(soup),
61
+ "headline": self._extract_headline(soup),
62
+ "location": self._extract_location(soup),
63
+ "industry": self._extract_industry(soup),
64
+ "connection_degree": self._extract_connection_degree(soup),
65
+ "experiences": self._extract_experiences(soup),
66
+ "education": self._extract_education(soup),
67
+ "decorations": self._extract_decorations(soup),
68
+ }
69
+ return data
70
+
71
+ def _extract_name(self, soup: BeautifulSoup) -> Optional[str]:
72
+ elem = soup.select_one('[data-test-row-lockup-full-name] a')
73
+ return elem.get_text(strip=True) if elem else None
74
+
75
+ def _extract_profile_url(self, soup: BeautifulSoup) -> Optional[str]:
76
+ elem = soup.select_one('[data-test-row-lockup-full-name] a')
77
+ return elem.get("href") if elem else None
78
+
79
+ def _extract_avatar_url(self, soup: BeautifulSoup) -> Optional[str]:
80
+ elem = soup.select_one("[data-test-lockup-image]")
81
+ return elem.get("src") if elem else None
82
+
83
+ def _extract_headline(self, soup: BeautifulSoup) -> Optional[str]:
84
+ elem = soup.select_one("[data-test-row-lockup-headline]")
85
+ return elem.get_text(strip=True) if elem else None
86
+
87
+ def _extract_location(self, soup: BeautifulSoup) -> Optional[str]:
88
+ elem = soup.select_one("[data-test-row-lockup-location]")
89
+ return elem.get_text(strip=True) if elem else None
90
+
91
+ def _extract_industry(self, soup: BeautifulSoup) -> Optional[str]:
92
+ elem = soup.select_one("[data-test-current-employer-industry]")
93
+ return elem.get_text(strip=True) if elem else None
94
+
95
+ def _extract_connection_degree(self, soup: BeautifulSoup) -> Optional[str]:
96
+ elem = soup.select_one('.artdeco-entity-lockup__badge span[aria-hidden="true"]')
97
+ if elem:
98
+ text = elem.get_text(strip=True)
99
+ return re.sub(r"^·\s*", "", text).strip()
100
+ badge = soup.select_one(".artdeco-entity-lockup__badge")
101
+ if badge:
102
+ text = badge.get_text(strip=True)
103
+ match = re.search(r"(\d+(?:st|nd|rd|th))", text)
104
+ return match.group(1) if match else text
105
+ return None
106
+
107
+ def _extract_experiences(self, soup: BeautifulSoup) -> List[Dict]:
108
+ return self._extract_history_group(soup, "Experience")
109
+
110
+ def _extract_education(self, soup: BeautifulSoup) -> List[Dict]:
111
+ return self._extract_history_group(soup, "Education")
112
+
113
+ def _extract_history_group(self, soup: BeautifulSoup, group_title: str) -> List[Dict]:
114
+ groups = soup.select(".history-group")
115
+ for group in groups:
116
+ title_elem = group.select_one(".history-group__term")
117
+ if title_elem and group_title in title_elem.get_text():
118
+ return [self._parse_history_item(li) for li in group.select(".history-group__list-items li")]
119
+ return []
120
+
121
+ def _parse_history_item(self, li: BeautifulSoup) -> Dict:
122
+ main_text = li.get_text(strip=True)
123
+ date_elem = li.select_one(".description-entry__date-duration")
124
+ date_range = None
125
+ if date_elem:
126
+ times = date_elem.find_all("time")
127
+ if len(times) == 2:
128
+ date_range = f"{times[0].get_text(strip=True)} – {times[1].get_text(strip=True)}"
129
+ elif len(times) == 1:
130
+ date_range = times[0].get_text(strip=True)
131
+ return {"title": main_text, "date_range": date_range}
132
+
133
+ def _extract_decorations(self, soup: BeautifulSoup) -> Dict:
134
+ decorations: Dict[str, List[str]] = {"interest": [], "activity": []}
135
+ for row in soup.select(".decorations__row"):
136
+ label = row.select_one(".decorations__row-title")
137
+ if not label:
138
+ continue
139
+ label_text = label.get_text(strip=True).lower()
140
+ triggers = row.select(".base-decoration__trigger-text")
141
+ if label_text == "interest":
142
+ decorations["interest"] = [t.get_text(strip=True) for t in triggers]
143
+ elif label_text == "activity":
144
+ decorations["activity"] = [t.get_text(strip=True) for t in triggers]
145
+ return decorations
146
+
147
+
148
+ # ===================== 2. 页面解析器 =====================
149
+
150
+ class PageParser(ABC):
151
+ """页面解析器抽象接口"""
152
+
153
+ @abstractmethod
154
+ def parse_profiles(self, page_source: str) -> List[Dict]:
155
+ """从页面源码中解析所有个人资料"""
156
+ pass
157
+
158
+
159
+ class LinkedInPageParser(PageParser):
160
+ """LinkedIn 搜索结果页面解析器"""
161
+
162
+ def __init__(self, profile_parser: ProfileParser):
163
+ self.profile_parser = profile_parser
164
+
165
+ def parse_profiles(self, page_source: str) -> List[Dict]:
166
+ soup = BeautifulSoup(page_source, "html.parser")
167
+ profile_items = soup.select("li.profile-list__border-bottom")
168
+ return [self.profile_parser.parse(item) for item in profile_items]
169
+
170
+
171
+ # ===================== 3. 页面导航接口 =====================
172
+
173
+ class PageNavigator(ABC):
174
+ """页面导航抽象接口"""
175
+
176
+ @abstractmethod
177
+ def scroll_and_wait(self, driver: WebDriver, **kwargs) -> None:
178
+ """滚动页面并等待"""
179
+ pass
180
+
181
+ @abstractmethod
182
+ def click_next(self, driver: WebDriver, timeout: int = 10) -> bool:
183
+ """点击下一页按钮,返回是否成功"""
184
+ pass
185
+
186
+
187
+ class SeleniumPageNavigator(PageNavigator):
188
+ """基于 Selenium 的页面导航实现"""
189
+
190
+ def scroll_and_wait(
191
+ self,
192
+ driver: WebDriver,
193
+ max_scroll_rounds: int = 30,
194
+ height_stable_count: int = 3,
195
+ min_wait: float = 1.0,
196
+ max_wait: float = 3.0,
197
+ scroll_step_ratio: tuple = (0.3, 0.8),
198
+ ) -> None:
199
+ """动态滚动加载页面所有内容"""
200
+ print("开始智能滚动加载(动态检测内容)...")
201
+
202
+ viewport_height = driver.execute_script("return window.innerHeight")
203
+
204
+ def get_total_height() -> int:
205
+ return driver.execute_script("return document.body.scrollHeight")
206
+
207
+ def get_scroll_y() -> int:
208
+ return driver.execute_script("return window.scrollY")
209
+
210
+ def scroll_to(y: int) -> None:
211
+ driver.execute_script(f"window.scrollTo(0, {y});")
212
+
213
+ def scroll_by(delta: int) -> None:
214
+ driver.execute_script(f"window.scrollBy(0, {delta});")
215
+
216
+ def random_wait() -> None:
217
+ time.sleep(random.uniform(min_wait, max_wait))
218
+
219
+ last_height = get_total_height()
220
+ stable_rounds = 0
221
+ current_round = 0
222
+ current_pos = get_scroll_y()
223
+
224
+ while current_round < max_scroll_rounds and stable_rounds < height_stable_count:
225
+ step = random.uniform(*scroll_step_ratio) * viewport_height
226
+ target_y = min(current_pos + step, last_height)
227
+
228
+ if last_height - target_y < viewport_height * 0.2:
229
+ target_y = last_height
230
+
231
+ scroll_to(target_y)
232
+ random_wait()
233
+
234
+ current_pos = get_scroll_y()
235
+ new_height = get_total_height()
236
+
237
+ if new_height > last_height:
238
+ print(f"轮次 {current_round + 1}: 高度从 {last_height} 增加到 {new_height}")
239
+ last_height = new_height
240
+ stable_rounds = 0
241
+ if random.random() < 0.3:
242
+ back_distance = random.uniform(0.1, 0.3) * viewport_height
243
+ scroll_by(-back_distance)
244
+ random_wait()
245
+ scroll_by(back_distance)
246
+ random_wait()
247
+ else:
248
+ stable_rounds += 1
249
+ print(f"轮次 {current_round + 1}: 高度未变化 ({new_height}),稳定计数 {stable_rounds}")
250
+
251
+ if (last_height - current_pos) < viewport_height * 0.1 and stable_rounds < height_stable_count:
252
+ print("接近底部但高度未变,强制滚动到底部并等待加载...")
253
+ scroll_to(last_height)
254
+ random_wait()
255
+ new_height = get_total_height()
256
+ if new_height > last_height:
257
+ last_height = new_height
258
+ stable_rounds = 0
259
+ current_pos = get_scroll_y()
260
+
261
+ current_round += 1
262
+
263
+ scroll_to(get_total_height())
264
+ time.sleep(random.uniform(1.5, 3.0))
265
+
266
+ if random.random() < 0.5:
267
+ print("最终回滚到中部并再次滚动到底部...")
268
+ scroll_to(get_total_height() // 2)
269
+ time.sleep(random.uniform(0.5, 1.5))
270
+ scroll_to(get_total_height())
271
+ time.sleep(random.uniform(1.0, 2.0))
272
+
273
+ print("智能滚动加载完成,页面所有内容已尝试加载。")
274
+
275
+ def click_next(self, driver: WebDriver, timeout: int = 10) -> bool:
276
+ try:
277
+ next_btn = WebDriverWait(driver, timeout).until(
278
+ EC.element_to_be_clickable((By.CSS_SELECTOR, "[data-test-pagination-next]"))
279
+ )
280
+ if next_btn.get_attribute("aria-disabled") == "true" or next_btn.get_attribute("disabled") is not None:
281
+ print("Next 按钮已禁用,没有更多页面。")
282
+ return False
283
+ driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", next_btn)
284
+ time.sleep(0.5)
285
+ next_btn.click()
286
+ time.sleep(5)
287
+ print("已点击 Next 按钮")
288
+ time.sleep(3)
289
+ return True
290
+ except TimeoutException:
291
+ print("未找到 Next 按钮或按钮不可点击,到达最后一页。")
292
+ return False
293
+ except Exception as e:
294
+ print(f"点击 Next 按钮时出错: {e}")
295
+ return False
296
+
297
+
298
+ # ===================== 4. 数据持久化接口 =====================
299
+
300
+ class DataSaver(ABC):
301
+ """数据保存抽象接口"""
302
+
303
+ @abstractmethod
304
+ def save(self, data: List[Dict], page_num: int, search_keyword: str = "", folder: Optional[str] = None) -> None:
305
+ pass
306
+
307
+
308
+ class JsonDataSaver(DataSaver):
309
+ """JSON 格式保存实现"""
310
+
311
+ def save(self, data: List[Dict], page_num: int, search_keyword: str = "", folder: Optional[str] = None) -> None:
312
+ filename = f"{search_keyword}_page_{page_num}.json" if search_keyword else f"profiles_page_{page_num}.json"
313
+ if folder:
314
+ os.makedirs(folder, exist_ok=True)
315
+ filepath = os.path.join(folder, filename)
316
+ else:
317
+ filepath = filename
318
+ with open(filepath, "w", encoding="utf-8") as f:
319
+ json.dump(data, f, ensure_ascii=False, indent=2)
320
+ print(f"已保存 {len(data)} 条个人资料到 {filepath}")
321
+
322
+
323
+ # ===================== 5. 爬虫核心 =====================
324
+
325
+ class LinkedInScraper:
326
+ """LinkedIn 搜索结果爬虫,协调各组件完成爬取任务"""
327
+
328
+ def __init__(
329
+ self,
330
+ driver: WebDriver,
331
+ page_parser: PageParser,
332
+ navigator: PageNavigator,
333
+ data_saver: DataSaver,
334
+ ):
335
+ self.driver = driver
336
+ self.page_parser = page_parser
337
+ self.navigator = navigator
338
+ self.data_saver = data_saver
339
+
340
+ def _check_json_list_length(self, page_num: int, search_keyword: str, folder: Optional[str]) -> int:
341
+ filename = f"{search_keyword}_page_{page_num}.json" if search_keyword else f"profiles_page_{page_num}.json"
342
+ if folder:
343
+ os.makedirs(folder, exist_ok=True)
344
+ filepath = os.path.join(folder, filename)
345
+ else:
346
+ filepath = filename
347
+
348
+ if os.path.exists(filepath):
349
+ print(f"文件 {filepath} 存在,跳过...")
350
+ return 1
351
+ else:
352
+ print(f"文件 {filepath} 不存在,返回长度 0。")
353
+ return 0
354
+
355
+ def scrape(
356
+ self,
357
+ start_url: str,
358
+ search_keyword: str = "",
359
+ folder: Optional[str] = None,
360
+ scroll_times: int = 100,
361
+ max_pages: Optional[int] = None,
362
+ ) -> None:
363
+ """执行爬取"""
364
+ page_num = 1
365
+ self.driver.get(start_url)
366
+ time.sleep(10)
367
+ json_list_length = self._check_json_list_length(page_num, search_keyword, folder)
368
+
369
+ if json_list_length != 1:
370
+ self.navigator.scroll_and_wait(self.driver)
371
+
372
+ self._process_current_page(page_num, search_keyword, folder)
373
+
374
+ while (max_pages is None or page_num < max_pages) and self.navigator.click_next(self.driver):
375
+ page_num += 1
376
+ self._process_current_page(page_num, search_keyword, folder)
377
+
378
+ def _process_current_page(self, page_num: int, search_keyword: str, folder: Optional[str]) -> None:
379
+ json_list_length = self._check_json_list_length(page_num, search_keyword, folder)
380
+ if json_list_length == 1:
381
+ return
382
+
383
+ self.navigator.scroll_and_wait(self.driver)
384
+ page_source = self.driver.page_source
385
+ profiles = self.page_parser.parse_profiles(page_source)
386
+ self.data_saver.save(profiles, page_num, search_keyword, folder)