skysearch 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- skysearch-0.2.0/PKG-INFO +162 -0
- skysearch-0.2.0/README.md +138 -0
- skysearch-0.2.0/pyproject.toml +42 -0
- skysearch-0.2.0/setup.cfg +4 -0
- skysearch-0.2.0/src/skysearch/__init__.py +40 -0
- skysearch-0.2.0/src/skysearch/api.py +279 -0
- skysearch-0.2.0/src/skysearch/cli.py +129 -0
- skysearch-0.2.0/src/skysearch/fetcher/__init__.py +17 -0
- skysearch-0.2.0/src/skysearch/fetcher/core.py +77 -0
- skysearch-0.2.0/src/skysearch/fetcher/parser.py +64 -0
- skysearch-0.2.0/src/skysearch/fetcher/session.py +95 -0
- skysearch-0.2.0/src/skysearch/ranker.py +29 -0
- skysearch-0.2.0/src/skysearch/search.py +67 -0
- skysearch-0.2.0/src/skysearch.egg-info/PKG-INFO +162 -0
- skysearch-0.2.0/src/skysearch.egg-info/SOURCES.txt +17 -0
- skysearch-0.2.0/src/skysearch.egg-info/dependency_links.txt +1 -0
- skysearch-0.2.0/src/skysearch.egg-info/entry_points.txt +2 -0
- skysearch-0.2.0/src/skysearch.egg-info/requires.txt +6 -0
- skysearch-0.2.0/src/skysearch.egg-info/top_level.txt +1 -0
skysearch-0.2.0/PKG-INFO
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: skysearch
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Search the web, rank results, fetch any page content.
|
|
5
|
+
Author: zimvir
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/zimvir/skysearch
|
|
8
|
+
Project-URL: Documentation, https://github.com/zimvir/skysearch#readme
|
|
9
|
+
Keywords: search,fetch,bing,bm25,web-scraper,crawler
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
16
|
+
Requires-Python: >=3.11
|
|
17
|
+
Description-Content-Type: text/markdown
|
|
18
|
+
Requires-Dist: DrissionPage
|
|
19
|
+
Requires-Dist: beautifulsoup4
|
|
20
|
+
Requires-Dist: lxml
|
|
21
|
+
Requires-Dist: readability-lxml
|
|
22
|
+
Requires-Dist: jieba
|
|
23
|
+
Requires-Dist: rank-bm25
|
|
24
|
+
|
|
25
|
+
# SkySearch
|
|
26
|
+
|
|
27
|
+
基于 Bing 搜索 + BM25 排序的轻量级搜索引擎,支持动态页面抓取。
|
|
28
|
+
|
|
29
|
+
## 特性
|
|
30
|
+
|
|
31
|
+
- **Bing 搜索**:SessionPage 底层 TLS 指纹伪装,绕过反爬
|
|
32
|
+
- **动态页面**:Chromium 多标签页并行抓取
|
|
33
|
+
- **BM25 排序**:jieba 中文分词 + rank-bm25 相关性排序
|
|
34
|
+
- **多模式输出**:`text` / `info` / `raw` 三种输出模式
|
|
35
|
+
- **CLI 工具**:支持命令行参数,也可用作 Python 库
|
|
36
|
+
|
|
37
|
+
## 安装
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
pip install skysearch
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## 命令行使用
|
|
44
|
+
|
|
45
|
+
### 搜索模式
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
# 交互式输入
|
|
49
|
+
skysearch
|
|
50
|
+
|
|
51
|
+
# 指定关键词
|
|
52
|
+
skysearch "深度学习框架"
|
|
53
|
+
|
|
54
|
+
# 指定结果数量
|
|
55
|
+
skysearch -n 20 "Python教程"
|
|
56
|
+
|
|
57
|
+
# 保持浏览器打开
|
|
58
|
+
skysearch -n 20 "关键词" --keep
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### URL 抓取模式
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
# 默认:输出纯文本
|
|
65
|
+
skysearch --url https://example.com
|
|
66
|
+
|
|
67
|
+
# 指定输出模式
|
|
68
|
+
skysearch --url https://example.com --mode text # 纯文本(默认)
|
|
69
|
+
skysearch --url https://example.com --mode info # 结构化信息
|
|
70
|
+
skysearch --url https://example.com --mode raw # 原始 HTML
|
|
71
|
+
|
|
72
|
+
# 保持浏览器打开
|
|
73
|
+
skysearch --url https://example.com --keep
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
## 作为库使用
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
import skysearch
|
|
80
|
+
|
|
81
|
+
# 搜索
|
|
82
|
+
results = skysearch.search("深度学习", num=10)
|
|
83
|
+
# [{'title': ..., 'url': ..., 'score': 12.5, 'snippet': ...}, ...]
|
|
84
|
+
|
|
85
|
+
# URL 抓取
|
|
86
|
+
text = skysearch.fetch("https://example.com")
|
|
87
|
+
info = skysearch.fetch("https://example.com", mode='info')
|
|
88
|
+
raw = skysearch.fetch("https://example.com", mode='raw')
|
|
89
|
+
|
|
90
|
+
# 单独函数
|
|
91
|
+
links = skysearch.fetch_links("https://example.com")
|
|
92
|
+
info_dict = skysearch.fetch_info("https://example.com")
|
|
93
|
+
raw_dict = skysearch.fetch_raw("https://example.com")
|
|
94
|
+
|
|
95
|
+
# 搜索 + 抓取一体化
|
|
96
|
+
results = skysearch.search_and_fetch("关键词", mode='info')
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
## API 参数说明
|
|
100
|
+
|
|
101
|
+
### search(query, num=10, verbose=False, keep=False, tuple_format=False)
|
|
102
|
+
|
|
103
|
+
| 参数 | 说明 | 默认值 |
|
|
104
|
+
|------|------|--------|
|
|
105
|
+
| query | 搜索关键词 | - |
|
|
106
|
+
| num | 结果数量 | 10 |
|
|
107
|
+
| verbose | 打印详细过程 | False |
|
|
108
|
+
| keep | 保持浏览器打开 | False |
|
|
109
|
+
| tuple_format | 返回元组格式 | False |
|
|
110
|
+
|
|
111
|
+
### fetch(url, mode='text', keep=False, timeout=10, retry=2)
|
|
112
|
+
|
|
113
|
+
| 参数 | 说明 | 默认值 |
|
|
114
|
+
|------|------|--------|
|
|
115
|
+
| url | 页面 URL | - |
|
|
116
|
+
| mode | 输出模式:`text` `info` `raw` | text |
|
|
117
|
+
| keep | 保持浏览器打开 | False |
|
|
118
|
+
| timeout | 请求超时秒数 | 10 |
|
|
119
|
+
| retry | 重试次数 | 2 |
|
|
120
|
+
|
|
121
|
+
### search_and_fetch(query, num=10, mode='text', verbose=False, keep=False, tuple_format=False)
|
|
122
|
+
|
|
123
|
+
一体化搜索 + 抓取,返回列表包含结果信息和抓取内容。
|
|
124
|
+
|
|
125
|
+
## 输出模式说明
|
|
126
|
+
|
|
127
|
+
| 模式 | 说明 | 适用场景 |
|
|
128
|
+
|------|------|----------|
|
|
129
|
+
| `text` | 纯文本正文 | 人类阅读 |
|
|
130
|
+
| `info` | 结构化 JSON(url, title, text, links, meta) | 数据分析 / agent |
|
|
131
|
+
| `raw` | 原始 HTML | 深度解析 |
|
|
132
|
+
|
|
133
|
+
## 技术栈
|
|
134
|
+
|
|
135
|
+
| 模块 | 技术 |
|
|
136
|
+
|------|------|
|
|
137
|
+
| HTTP 请求 | DrissionPage (SessionPage) |
|
|
138
|
+
| 动态渲染 | DrissionPage (ChromiumPage) |
|
|
139
|
+
| HTML 解析 | BeautifulSoup4 + lxml |
|
|
140
|
+
| 正文提取 | readability-lxml |
|
|
141
|
+
| 中文分词 | jieba |
|
|
142
|
+
| 排序算法 | rank-bm25 (BM25Okapi) |
|
|
143
|
+
|
|
144
|
+
## 项目结构
|
|
145
|
+
|
|
146
|
+
```
|
|
147
|
+
src/skysearch/
|
|
148
|
+
├── __init__.py # 库入口,导出所有 API
|
|
149
|
+
├── cli.py # 命令行入口
|
|
150
|
+
├── search.py # Bing 搜索
|
|
151
|
+
├── ranker.py # BM25 排序
|
|
152
|
+
├── api.py # 简洁 API 接口
|
|
153
|
+
└── fetcher/ # 页面抓取包
|
|
154
|
+
├── __init__.py
|
|
155
|
+
├── core.py # 核心函数
|
|
156
|
+
├── session.py # 会话管理
|
|
157
|
+
└── parser.py # HTML 解析
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
## License
|
|
161
|
+
|
|
162
|
+
MIT
|
|
@@ -0,0 +1,138 @@
|
|
|
1
|
+
# SkySearch
|
|
2
|
+
|
|
3
|
+
基于 Bing 搜索 + BM25 排序的轻量级搜索引擎,支持动态页面抓取。
|
|
4
|
+
|
|
5
|
+
## 特性
|
|
6
|
+
|
|
7
|
+
- **Bing 搜索**:SessionPage 底层 TLS 指纹伪装,绕过反爬
|
|
8
|
+
- **动态页面**:Chromium 多标签页并行抓取
|
|
9
|
+
- **BM25 排序**:jieba 中文分词 + rank-bm25 相关性排序
|
|
10
|
+
- **多模式输出**:`text` / `info` / `raw` 三种输出模式
|
|
11
|
+
- **CLI 工具**:支持命令行参数,也可用作 Python 库
|
|
12
|
+
|
|
13
|
+
## 安装
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
pip install skysearch
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
## 命令行使用
|
|
20
|
+
|
|
21
|
+
### 搜索模式
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
# 交互式输入
|
|
25
|
+
skysearch
|
|
26
|
+
|
|
27
|
+
# 指定关键词
|
|
28
|
+
skysearch "深度学习框架"
|
|
29
|
+
|
|
30
|
+
# 指定结果数量
|
|
31
|
+
skysearch -n 20 "Python教程"
|
|
32
|
+
|
|
33
|
+
# 保持浏览器打开
|
|
34
|
+
skysearch -n 20 "关键词" --keep
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
### URL 抓取模式
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
# 默认:输出纯文本
|
|
41
|
+
skysearch --url https://example.com
|
|
42
|
+
|
|
43
|
+
# 指定输出模式
|
|
44
|
+
skysearch --url https://example.com --mode text # 纯文本(默认)
|
|
45
|
+
skysearch --url https://example.com --mode info # 结构化信息
|
|
46
|
+
skysearch --url https://example.com --mode raw # 原始 HTML
|
|
47
|
+
|
|
48
|
+
# 保持浏览器打开
|
|
49
|
+
skysearch --url https://example.com --keep
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
## 作为库使用
|
|
53
|
+
|
|
54
|
+
```python
|
|
55
|
+
import skysearch
|
|
56
|
+
|
|
57
|
+
# 搜索
|
|
58
|
+
results = skysearch.search("深度学习", num=10)
|
|
59
|
+
# [{'title': ..., 'url': ..., 'score': 12.5, 'snippet': ...}, ...]
|
|
60
|
+
|
|
61
|
+
# URL 抓取
|
|
62
|
+
text = skysearch.fetch("https://example.com")
|
|
63
|
+
info = skysearch.fetch("https://example.com", mode='info')
|
|
64
|
+
raw = skysearch.fetch("https://example.com", mode='raw')
|
|
65
|
+
|
|
66
|
+
# 单独函数
|
|
67
|
+
links = skysearch.fetch_links("https://example.com")
|
|
68
|
+
info_dict = skysearch.fetch_info("https://example.com")
|
|
69
|
+
raw_dict = skysearch.fetch_raw("https://example.com")
|
|
70
|
+
|
|
71
|
+
# 搜索 + 抓取一体化
|
|
72
|
+
results = skysearch.search_and_fetch("关键词", mode='info')
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## API 参数说明
|
|
76
|
+
|
|
77
|
+
### search(query, num=10, verbose=False, keep=False, tuple_format=False)
|
|
78
|
+
|
|
79
|
+
| 参数 | 说明 | 默认值 |
|
|
80
|
+
|------|------|--------|
|
|
81
|
+
| query | 搜索关键词 | - |
|
|
82
|
+
| num | 结果数量 | 10 |
|
|
83
|
+
| verbose | 打印详细过程 | False |
|
|
84
|
+
| keep | 保持浏览器打开 | False |
|
|
85
|
+
| tuple_format | 返回元组格式 | False |
|
|
86
|
+
|
|
87
|
+
### fetch(url, mode='text', keep=False, timeout=10, retry=2)
|
|
88
|
+
|
|
89
|
+
| 参数 | 说明 | 默认值 |
|
|
90
|
+
|------|------|--------|
|
|
91
|
+
| url | 页面 URL | - |
|
|
92
|
+
| mode | 输出模式:`text` `info` `raw` | text |
|
|
93
|
+
| keep | 保持浏览器打开 | False |
|
|
94
|
+
| timeout | 请求超时秒数 | 10 |
|
|
95
|
+
| retry | 重试次数 | 2 |
|
|
96
|
+
|
|
97
|
+
### search_and_fetch(query, num=10, mode='text', verbose=False, keep=False, tuple_format=False)
|
|
98
|
+
|
|
99
|
+
一体化搜索 + 抓取,返回列表包含结果信息和抓取内容。
|
|
100
|
+
|
|
101
|
+
## 输出模式说明
|
|
102
|
+
|
|
103
|
+
| 模式 | 说明 | 适用场景 |
|
|
104
|
+
|------|------|----------|
|
|
105
|
+
| `text` | 纯文本正文 | 人类阅读 |
|
|
106
|
+
| `info` | 结构化 JSON(url, title, text, links, meta) | 数据分析 / agent |
|
|
107
|
+
| `raw` | 原始 HTML | 深度解析 |
|
|
108
|
+
|
|
109
|
+
## 技术栈
|
|
110
|
+
|
|
111
|
+
| 模块 | 技术 |
|
|
112
|
+
|------|------|
|
|
113
|
+
| HTTP 请求 | DrissionPage (SessionPage) |
|
|
114
|
+
| 动态渲染 | DrissionPage (ChromiumPage) |
|
|
115
|
+
| HTML 解析 | BeautifulSoup4 + lxml |
|
|
116
|
+
| 正文提取 | readability-lxml |
|
|
117
|
+
| 中文分词 | jieba |
|
|
118
|
+
| 排序算法 | rank-bm25 (BM25Okapi) |
|
|
119
|
+
|
|
120
|
+
## 项目结构
|
|
121
|
+
|
|
122
|
+
```
|
|
123
|
+
src/skysearch/
|
|
124
|
+
├── __init__.py # 库入口,导出所有 API
|
|
125
|
+
├── cli.py # 命令行入口
|
|
126
|
+
├── search.py # Bing 搜索
|
|
127
|
+
├── ranker.py # BM25 排序
|
|
128
|
+
├── api.py # 简洁 API 接口
|
|
129
|
+
└── fetcher/ # 页面抓取包
|
|
130
|
+
├── __init__.py
|
|
131
|
+
├── core.py # 核心函数
|
|
132
|
+
├── session.py # 会话管理
|
|
133
|
+
└── parser.py # HTML 解析
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
## License
|
|
137
|
+
|
|
138
|
+
MIT
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
[project]
|
|
2
|
+
name = "skysearch"
|
|
3
|
+
version = "0.2.0"
|
|
4
|
+
description = "Search the web, rank results, fetch any page content."
|
|
5
|
+
readme = "README.md"
|
|
6
|
+
requires-python = ">=3.11"
|
|
7
|
+
license = {text = "MIT"}
|
|
8
|
+
authors = [
|
|
9
|
+
{name = "zimvir"}
|
|
10
|
+
]
|
|
11
|
+
keywords = ["search", "fetch", "bing", "bm25", "web-scraper", "crawler"]
|
|
12
|
+
classifiers = [
|
|
13
|
+
"Development Status :: 3 - Alpha",
|
|
14
|
+
"Intended Audience :: Developers",
|
|
15
|
+
"License :: OSI Approved :: MIT License",
|
|
16
|
+
"Programming Language :: Python :: 3",
|
|
17
|
+
"Programming Language :: Python :: 3.11",
|
|
18
|
+
"Programming Language :: Python :: 3.12",
|
|
19
|
+
]
|
|
20
|
+
|
|
21
|
+
dependencies = [
|
|
22
|
+
"DrissionPage",
|
|
23
|
+
"beautifulsoup4",
|
|
24
|
+
"lxml",
|
|
25
|
+
"readability-lxml",
|
|
26
|
+
"jieba",
|
|
27
|
+
"rank-bm25",
|
|
28
|
+
]
|
|
29
|
+
|
|
30
|
+
[project.scripts]
|
|
31
|
+
skysearch = "skysearch.cli:main"
|
|
32
|
+
|
|
33
|
+
[project.urls]
|
|
34
|
+
Homepage = "https://github.com/zimvir/skysearch"
|
|
35
|
+
Documentation = "https://github.com/zimvir/skysearch#readme"
|
|
36
|
+
|
|
37
|
+
[build-system]
|
|
38
|
+
requires = ["setuptools>=61.0"]
|
|
39
|
+
build-backend = "setuptools.build_meta"
|
|
40
|
+
|
|
41
|
+
[tool.setuptools.packages.find]
|
|
42
|
+
where = ["src"]
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
"""SkySearch - 基于 Bing 搜索 + BM25 排序的轻量级搜索引擎。"""
|
|
2
|
+
|
|
3
|
+
__version__ = "0.1.0"
|
|
4
|
+
|
|
5
|
+
from .search import bing_search
|
|
6
|
+
from .fetcher import (
|
|
7
|
+
fetch_page_text,
|
|
8
|
+
fetch_all_pages,
|
|
9
|
+
is_dynamic_page,
|
|
10
|
+
parse_html,
|
|
11
|
+
parse_info,
|
|
12
|
+
)
|
|
13
|
+
from .ranker import rank_documents
|
|
14
|
+
from . import api
|
|
15
|
+
|
|
16
|
+
__all__ = [
|
|
17
|
+
'bing_search',
|
|
18
|
+
'fetch_page_text',
|
|
19
|
+
'fetch_all_pages',
|
|
20
|
+
'is_dynamic_page',
|
|
21
|
+
'parse_html',
|
|
22
|
+
'parse_info',
|
|
23
|
+
'rank_documents',
|
|
24
|
+
'api',
|
|
25
|
+
# API 函数
|
|
26
|
+
'search',
|
|
27
|
+
'fetch',
|
|
28
|
+
'fetch_info',
|
|
29
|
+
'fetch_raw',
|
|
30
|
+
'fetch_links',
|
|
31
|
+
'search_and_fetch',
|
|
32
|
+
]
|
|
33
|
+
|
|
34
|
+
# 快捷引用
|
|
35
|
+
search = api.search
|
|
36
|
+
fetch = api.fetch
|
|
37
|
+
fetch_info = api.fetch_info
|
|
38
|
+
fetch_raw = api.fetch_raw
|
|
39
|
+
fetch_links = api.fetch_links
|
|
40
|
+
search_and_fetch = api.search_and_fetch
|
|
@@ -0,0 +1,279 @@
|
|
|
1
|
+
"""简洁 API 接口,模拟 CLI 行为。"""
|
|
2
|
+
|
|
3
|
+
import json
|
|
4
|
+
from .search import bing_search as _bing_search
|
|
5
|
+
from .fetcher import (
|
|
6
|
+
fetch_page_text as _fetch_page_text,
|
|
7
|
+
fetch_all_pages as _fetch_all_pages,
|
|
8
|
+
parse_html as _parse_html,
|
|
9
|
+
is_dynamic_page as _is_dynamic_page,
|
|
10
|
+
parse_info as _parse_info,
|
|
11
|
+
)
|
|
12
|
+
from .ranker import rank_documents as _rank_documents
|
|
13
|
+
from DrissionPage import SessionPage, ChromiumOptions, ChromiumPage
|
|
14
|
+
from bs4 import BeautifulSoup
|
|
15
|
+
import time
|
|
16
|
+
|
|
17
|
+
|
|
18
|
+
def search(query, num=10, verbose=False, keep=False, tuple_format=False):
|
|
19
|
+
"""
|
|
20
|
+
搜索模式(类似 CLI: skysearch "关键词")。
|
|
21
|
+
|
|
22
|
+
Args:
|
|
23
|
+
query: 搜索关键词
|
|
24
|
+
num: 结果数量(默认 10)
|
|
25
|
+
verbose: 是否打印详细过程(默认 False)
|
|
26
|
+
keep: 是否保持浏览器打开(默认 False)
|
|
27
|
+
tuple_format: 是否返回元组格式(默认 False)
|
|
28
|
+
False: [{"title": ..., "url": ..., "score": 12.5}, ...]
|
|
29
|
+
True: [({"title": ..., "url": ...}, 12.5), ...]
|
|
30
|
+
|
|
31
|
+
Returns:
|
|
32
|
+
排序后的结果列表
|
|
33
|
+
"""
|
|
34
|
+
if verbose:
|
|
35
|
+
print(f"Searching Bing for: {query}\n")
|
|
36
|
+
|
|
37
|
+
results = _bing_search(query, num_results=num)
|
|
38
|
+
if not results:
|
|
39
|
+
if verbose:
|
|
40
|
+
print("No results found.")
|
|
41
|
+
return []
|
|
42
|
+
|
|
43
|
+
if verbose:
|
|
44
|
+
print(f"Found {len(results)} results, fetching content...\n")
|
|
45
|
+
|
|
46
|
+
urls = [r['url'] for r in results]
|
|
47
|
+
raw_results = _fetch_all_pages(urls, verbose=verbose)
|
|
48
|
+
raw_results.sort(key=lambda x: x[0])
|
|
49
|
+
|
|
50
|
+
texts = []
|
|
51
|
+
valid_results = []
|
|
52
|
+
|
|
53
|
+
for i, (idx, text, is_dynamic) in enumerate(raw_results):
|
|
54
|
+
if text:
|
|
55
|
+
texts.append(text)
|
|
56
|
+
valid_results.append(results[i])
|
|
57
|
+
if verbose:
|
|
58
|
+
status = "[D]" if is_dynamic else "[S]"
|
|
59
|
+
print(f"[{i+1}] {status} {results[i]['title'][:40]} | length: {len(text)}")
|
|
60
|
+
|
|
61
|
+
if not texts:
|
|
62
|
+
if verbose:
|
|
63
|
+
print("\nNo page content fetched.")
|
|
64
|
+
return []
|
|
65
|
+
|
|
66
|
+
if verbose:
|
|
67
|
+
print(f"\nRanking... (valid pages: {len(texts)})\n")
|
|
68
|
+
|
|
69
|
+
scores = _rank_documents(texts, query)
|
|
70
|
+
|
|
71
|
+
if tuple_format:
|
|
72
|
+
ranked = list(zip(valid_results, scores))
|
|
73
|
+
ranked.sort(key=lambda x: x[1], reverse=True)
|
|
74
|
+
return ranked
|
|
75
|
+
|
|
76
|
+
# 默认返回扁平列表
|
|
77
|
+
output = []
|
|
78
|
+
for r, score in zip(valid_results, scores):
|
|
79
|
+
item = dict(r)
|
|
80
|
+
item['score'] = score
|
|
81
|
+
output.append(item)
|
|
82
|
+
|
|
83
|
+
output.sort(key=lambda x: x['score'], reverse=True)
|
|
84
|
+
return output
|
|
85
|
+
|
|
86
|
+
|
|
87
|
+
def fetch(url, mode='text', keep=False, timeout=10, retry=2):
|
|
88
|
+
"""
|
|
89
|
+
URL 抓取模式(类似 CLI: skysearch --url xxx)。
|
|
90
|
+
|
|
91
|
+
Args:
|
|
92
|
+
url: 页面 URL
|
|
93
|
+
mode: 输出模式,'text' | 'info' | 'raw'(默认 'text')
|
|
94
|
+
keep: 是否保持浏览器打开(默认 False)
|
|
95
|
+
timeout: 请求超时秒数(默认 10)
|
|
96
|
+
retry: 重试次数(默认 2)
|
|
97
|
+
|
|
98
|
+
Returns:
|
|
99
|
+
根据模式返回不同内容
|
|
100
|
+
"""
|
|
101
|
+
page = SessionPage()
|
|
102
|
+
page.get(url, retry=retry, timeout=timeout)
|
|
103
|
+
html = page.html
|
|
104
|
+
|
|
105
|
+
# 动态页面用浏览器渲染
|
|
106
|
+
if _is_dynamic_page(html):
|
|
107
|
+
options = ChromiumOptions()
|
|
108
|
+
options.headless = True
|
|
109
|
+
browser = ChromiumPage(options)
|
|
110
|
+
try:
|
|
111
|
+
browser.get(url)
|
|
112
|
+
browser.wait.load_start()
|
|
113
|
+
time.sleep(2)
|
|
114
|
+
html = browser.html
|
|
115
|
+
finally:
|
|
116
|
+
if not keep:
|
|
117
|
+
browser.quit()
|
|
118
|
+
|
|
119
|
+
if mode == 'raw':
|
|
120
|
+
return json.dumps({"url": url, "html": html}, ensure_ascii=False)
|
|
121
|
+
|
|
122
|
+
if mode == 'info':
|
|
123
|
+
result = {"url": url}
|
|
124
|
+
result.update(_parse_html(html, mode='info'))
|
|
125
|
+
return json.dumps(result, ensure_ascii=False, indent=2)
|
|
126
|
+
|
|
127
|
+
if mode == 'text':
|
|
128
|
+
return _fetch_page_text(url, close_browser=not keep)
|
|
129
|
+
|
|
130
|
+
return None
|
|
131
|
+
|
|
132
|
+
|
|
133
|
+
def fetch_info(url, keep=False, timeout=10, retry=2):
|
|
134
|
+
"""
|
|
135
|
+
获取结构化信息(url, title, text, links, meta)。
|
|
136
|
+
|
|
137
|
+
Args:
|
|
138
|
+
url: 页面 URL
|
|
139
|
+
keep: 是否保持浏览器打开(默认 False)
|
|
140
|
+
timeout: 请求超时秒数(默认 10)
|
|
141
|
+
retry: 重试次数(默认 2)
|
|
142
|
+
|
|
143
|
+
Returns:
|
|
144
|
+
dict: {"url": ..., "title": ..., "text": ..., "links": [...], "meta": {...}}
|
|
145
|
+
"""
|
|
146
|
+
page = SessionPage()
|
|
147
|
+
page.get(url, retry=retry, timeout=timeout)
|
|
148
|
+
html = page.html
|
|
149
|
+
|
|
150
|
+
if _is_dynamic_page(html):
|
|
151
|
+
options = ChromiumOptions()
|
|
152
|
+
options.headless = True
|
|
153
|
+
browser = ChromiumPage(options)
|
|
154
|
+
try:
|
|
155
|
+
browser.get(url)
|
|
156
|
+
browser.wait.load_start()
|
|
157
|
+
time.sleep(2)
|
|
158
|
+
html = browser.html
|
|
159
|
+
finally:
|
|
160
|
+
if not keep:
|
|
161
|
+
browser.quit()
|
|
162
|
+
|
|
163
|
+
return {"url": url, **_parse_info(html)}
|
|
164
|
+
|
|
165
|
+
|
|
166
|
+
def fetch_raw(url, keep=False, timeout=10, retry=2):
|
|
167
|
+
"""
|
|
168
|
+
获取原始 HTML。
|
|
169
|
+
|
|
170
|
+
Args:
|
|
171
|
+
url: 页面 URL
|
|
172
|
+
keep: 是否保持浏览器打开(默认 False)
|
|
173
|
+
timeout: 请求超时秒数(默认 10)
|
|
174
|
+
retry: 重试次数(默认 2)
|
|
175
|
+
|
|
176
|
+
Returns:
|
|
177
|
+
dict: {"url": ..., "html": ...}
|
|
178
|
+
"""
|
|
179
|
+
page = SessionPage()
|
|
180
|
+
page.get(url, retry=retry, timeout=timeout)
|
|
181
|
+
html = page.html
|
|
182
|
+
|
|
183
|
+
if _is_dynamic_page(html):
|
|
184
|
+
options = ChromiumOptions()
|
|
185
|
+
options.headless = True
|
|
186
|
+
browser = ChromiumPage(options)
|
|
187
|
+
try:
|
|
188
|
+
browser.get(url)
|
|
189
|
+
browser.wait.load_start()
|
|
190
|
+
time.sleep(2)
|
|
191
|
+
html = browser.html
|
|
192
|
+
finally:
|
|
193
|
+
if not keep:
|
|
194
|
+
browser.quit()
|
|
195
|
+
|
|
196
|
+
return {"url": url, "html": html}
|
|
197
|
+
|
|
198
|
+
|
|
199
|
+
def fetch_links(url, timeout=10, retry=2):
|
|
200
|
+
"""
|
|
201
|
+
提取页面所有链接。
|
|
202
|
+
|
|
203
|
+
Args:
|
|
204
|
+
url: 页面 URL
|
|
205
|
+
timeout: 请求超时秒数(默认 10)
|
|
206
|
+
retry: 重试次数(默认 2)
|
|
207
|
+
|
|
208
|
+
Returns:
|
|
209
|
+
list: [{"text": "链接文字", "href": "https://..."}, ...]
|
|
210
|
+
"""
|
|
211
|
+
page = SessionPage()
|
|
212
|
+
page.get(url, retry=retry, timeout=timeout)
|
|
213
|
+
html = page.html
|
|
214
|
+
|
|
215
|
+
if _is_dynamic_page(html):
|
|
216
|
+
options = ChromiumOptions()
|
|
217
|
+
options.headless = True
|
|
218
|
+
browser = ChromiumPage(options)
|
|
219
|
+
try:
|
|
220
|
+
browser.get(url)
|
|
221
|
+
browser.wait.load_start()
|
|
222
|
+
time.sleep(2)
|
|
223
|
+
html = browser.html
|
|
224
|
+
finally:
|
|
225
|
+
browser.quit()
|
|
226
|
+
|
|
227
|
+
soup = BeautifulSoup(html, 'lxml')
|
|
228
|
+
links = []
|
|
229
|
+
for a in soup.find_all('a', href=True):
|
|
230
|
+
links.append({
|
|
231
|
+
"text": a.get_text(strip=True),
|
|
232
|
+
"href": a['href']
|
|
233
|
+
})
|
|
234
|
+
|
|
235
|
+
return links
|
|
236
|
+
|
|
237
|
+
|
|
238
|
+
def search_and_fetch(query, num=10, mode='text', verbose=False, keep=False, tuple_format=False):
|
|
239
|
+
"""
|
|
240
|
+
搜索 + 抓取一体化。
|
|
241
|
+
|
|
242
|
+
Args:
|
|
243
|
+
query: 搜索关键词
|
|
244
|
+
num: 结果数量(默认 10)
|
|
245
|
+
mode: 输出模式,'text' | 'info' | 'raw'(默认 'text')
|
|
246
|
+
verbose: 是否打印详细过程(默认 False)
|
|
247
|
+
keep: 是否保持浏览器打开(默认 False)
|
|
248
|
+
tuple_format: 是否返回元组格式(默认 False)
|
|
249
|
+
|
|
250
|
+
Returns:
|
|
251
|
+
list: 搜索结果列表,每个元素包含结果信息和抓取内容
|
|
252
|
+
"""
|
|
253
|
+
results = search(query, num=num, verbose=verbose, keep=keep, tuple_format=tuple_format)
|
|
254
|
+
|
|
255
|
+
output = []
|
|
256
|
+
for item in results:
|
|
257
|
+
if tuple_format:
|
|
258
|
+
url = item[0]['url']
|
|
259
|
+
score = item[1]
|
|
260
|
+
result_dict = item[0]
|
|
261
|
+
else:
|
|
262
|
+
url = item['url']
|
|
263
|
+
score = item.get('score', 0)
|
|
264
|
+
result_dict = item
|
|
265
|
+
|
|
266
|
+
if mode == 'raw':
|
|
267
|
+
content = fetch_raw(url, keep=keep)
|
|
268
|
+
elif mode == 'info':
|
|
269
|
+
content = fetch_info(url, keep=keep)
|
|
270
|
+
else:
|
|
271
|
+
content = fetch(url, mode='text', keep=keep)
|
|
272
|
+
|
|
273
|
+
output.append({
|
|
274
|
+
**result_dict,
|
|
275
|
+
"score": score,
|
|
276
|
+
"content": content
|
|
277
|
+
})
|
|
278
|
+
|
|
279
|
+
return output
|
|
@@ -0,0 +1,129 @@
|
|
|
1
|
+
"""命令行入口。"""
|
|
2
|
+
|
|
3
|
+
import argparse
|
|
4
|
+
import time
|
|
5
|
+
import json
|
|
6
|
+
from .search import bing_search
|
|
7
|
+
from .fetcher import fetch_all_pages, fetch_page_text, is_dynamic_page, parse_html
|
|
8
|
+
from .ranker import rank_documents
|
|
9
|
+
from DrissionPage import SessionPage, ChromiumOptions, ChromiumPage
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
def handle_url(args):
|
|
13
|
+
"""处理 --url 模式。"""
|
|
14
|
+
print(f"Fetching: {args.url}\n")
|
|
15
|
+
|
|
16
|
+
page = SessionPage()
|
|
17
|
+
page.get(args.url, retry=2, timeout=10)
|
|
18
|
+
html = page.html
|
|
19
|
+
|
|
20
|
+
# 动态页面用浏览器渲染
|
|
21
|
+
if is_dynamic_page(html):
|
|
22
|
+
options = ChromiumOptions()
|
|
23
|
+
options.headless = True
|
|
24
|
+
browser = ChromiumPage(options)
|
|
25
|
+
try:
|
|
26
|
+
browser.get(args.url)
|
|
27
|
+
browser.wait.load_start()
|
|
28
|
+
time.sleep(2)
|
|
29
|
+
html = browser.html
|
|
30
|
+
finally:
|
|
31
|
+
if not args.keep:
|
|
32
|
+
browser.quit()
|
|
33
|
+
|
|
34
|
+
# 根据模式解析输出
|
|
35
|
+
if args.mode == 'raw':
|
|
36
|
+
result = {"url": args.url, "html": html}
|
|
37
|
+
print(json.dumps(result, ensure_ascii=False))
|
|
38
|
+
elif args.mode == 'info':
|
|
39
|
+
result = {"url": args.url}
|
|
40
|
+
result.update(parse_html(html, mode='info'))
|
|
41
|
+
print(json.dumps(result, ensure_ascii=False, indent=2))
|
|
42
|
+
else:
|
|
43
|
+
text = fetch_page_text(args.url, close_browser=not args.keep)
|
|
44
|
+
if args.keep:
|
|
45
|
+
print("Browser kept open (use Ctrl+C to exit)")
|
|
46
|
+
elif text:
|
|
47
|
+
print(f"Content ({len(text)} chars):\n")
|
|
48
|
+
print(text)
|
|
49
|
+
else:
|
|
50
|
+
print("Failed to fetch page content.")
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
def handle_search(args):
|
|
54
|
+
"""处理搜索模式。"""
|
|
55
|
+
query = args.query or args.query2
|
|
56
|
+
if not query:
|
|
57
|
+
query = input("Enter search query: ")
|
|
58
|
+
|
|
59
|
+
print("Searching Bing...\n")
|
|
60
|
+
|
|
61
|
+
results = bing_search(query, num_results=args.num)
|
|
62
|
+
|
|
63
|
+
if not results:
|
|
64
|
+
print("No results found.")
|
|
65
|
+
return
|
|
66
|
+
|
|
67
|
+
print(f"Found {len(results)} results, fetching content...\n")
|
|
68
|
+
|
|
69
|
+
urls = [r['url'] for r in results]
|
|
70
|
+
raw_results = fetch_all_pages(urls, verbose=True)
|
|
71
|
+
raw_results.sort(key=lambda x: x[0])
|
|
72
|
+
|
|
73
|
+
texts = []
|
|
74
|
+
valid_results = []
|
|
75
|
+
|
|
76
|
+
print("="*50)
|
|
77
|
+
print("Fetch summary:")
|
|
78
|
+
print("="*50)
|
|
79
|
+
|
|
80
|
+
for i, (idx, text, is_dynamic) in enumerate(raw_results):
|
|
81
|
+
if text:
|
|
82
|
+
texts.append(text)
|
|
83
|
+
valid_results.append(results[i])
|
|
84
|
+
status = "[D]" if is_dynamic else "[S]"
|
|
85
|
+
else:
|
|
86
|
+
status = "[X]"
|
|
87
|
+
|
|
88
|
+
print(f"[{i+1}] {status} {results[i]['title'][:40]} | length: {len(text) if text else 0}")
|
|
89
|
+
|
|
90
|
+
if not texts:
|
|
91
|
+
print("\nNo page content fetched.")
|
|
92
|
+
return
|
|
93
|
+
|
|
94
|
+
print(f"\nRanking... (valid pages: {len(texts)})\n")
|
|
95
|
+
|
|
96
|
+
scores = rank_documents(texts, query)
|
|
97
|
+
ranked = list(zip(valid_results, scores))
|
|
98
|
+
ranked.sort(key=lambda x: x[1], reverse=True)
|
|
99
|
+
|
|
100
|
+
print("="*50)
|
|
101
|
+
print("Ranked results (higher score = more relevant):")
|
|
102
|
+
print("="*50)
|
|
103
|
+
|
|
104
|
+
for i, (r, score) in enumerate(ranked[:10], 1):
|
|
105
|
+
print(f"\n{i}. Score: {score:.3f}")
|
|
106
|
+
print(f" Title: {r['title']}")
|
|
107
|
+
print(f" URL: {r['url']}")
|
|
108
|
+
|
|
109
|
+
|
|
110
|
+
def main():
|
|
111
|
+
"""主入口函数。"""
|
|
112
|
+
parser = argparse.ArgumentParser(
|
|
113
|
+
description='SkySearch - A lightweight search engine based on Bing + BM25',
|
|
114
|
+
usage='%(prog)s [-n NUM] [-q QUERY] [QUERY]\n'
|
|
115
|
+
' %(prog)s --url <url>'
|
|
116
|
+
)
|
|
117
|
+
parser.add_argument('query', nargs='?', help='search query')
|
|
118
|
+
parser.add_argument('-n', '--num', type=int, default=10, help='number of results (default: 10)')
|
|
119
|
+
parser.add_argument('-q', '--query', dest='query2', help='search query (long form)')
|
|
120
|
+
parser.add_argument('--url', help='fetch a single URL directly')
|
|
121
|
+
parser.add_argument('--mode', choices=['text', 'info', 'raw'], default='text', help='output mode (default: text)')
|
|
122
|
+
parser.add_argument('--keep', action='store_true', help='keep browser open after fetching')
|
|
123
|
+
|
|
124
|
+
args = parser.parse_args()
|
|
125
|
+
|
|
126
|
+
if args.url:
|
|
127
|
+
handle_url(args)
|
|
128
|
+
else:
|
|
129
|
+
handle_search(args)
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
"""页面抓取模块。"""
|
|
2
|
+
|
|
3
|
+
from .core import fetch_page_text, is_dynamic_page, extract_text
|
|
4
|
+
from .session import fetch_page_session, fetch_all_pages
|
|
5
|
+
from .parser import parse_html, parse_info, parse_raw, parse_text
|
|
6
|
+
|
|
7
|
+
__all__ = [
|
|
8
|
+
'fetch_page_text',
|
|
9
|
+
'is_dynamic_page',
|
|
10
|
+
'extract_text',
|
|
11
|
+
'fetch_page_session',
|
|
12
|
+
'fetch_all_pages',
|
|
13
|
+
'parse_html',
|
|
14
|
+
'parse_info',
|
|
15
|
+
'parse_raw',
|
|
16
|
+
'parse_text',
|
|
17
|
+
]
|
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
"""页面抓取核心模块。"""
|
|
2
|
+
|
|
3
|
+
import time
|
|
4
|
+
import os
|
|
5
|
+
from DrissionPage import SessionPage, ChromiumPage, ChromiumOptions
|
|
6
|
+
from bs4 import BeautifulSoup
|
|
7
|
+
from readability.readability import Document
|
|
8
|
+
|
|
9
|
+
DEBUG = os.getenv('SKYSEARCH_DEBUG', 'false').lower() in ('1', 'true', 'yes')
|
|
10
|
+
|
|
11
|
+
|
|
12
|
+
def is_dynamic_page(html: str) -> bool:
|
|
13
|
+
"""判断页面是否需要浏览器渲染。"""
|
|
14
|
+
dynamic_markers = ['id="root"', 'id="app"', 'id="__next"', '__NEXT_DATA__']
|
|
15
|
+
|
|
16
|
+
for marker in dynamic_markers:
|
|
17
|
+
if marker in html:
|
|
18
|
+
return True
|
|
19
|
+
|
|
20
|
+
soup = BeautifulSoup(html, 'lxml')
|
|
21
|
+
text = soup.get_text()
|
|
22
|
+
if len(text) < 500:
|
|
23
|
+
return True
|
|
24
|
+
|
|
25
|
+
return False
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+
def extract_text(html: str) -> str:
|
|
29
|
+
"""使用 readability-lxml 提取正文文本。"""
|
|
30
|
+
if not html or len(html.strip()) == 0:
|
|
31
|
+
return ''
|
|
32
|
+
|
|
33
|
+
try:
|
|
34
|
+
doc = Document(html)
|
|
35
|
+
summary_html = doc.summary()
|
|
36
|
+
soup = BeautifulSoup(summary_html, 'lxml')
|
|
37
|
+
text = soup.get_text(separator=' ', strip=True)
|
|
38
|
+
return text if len(text) >= 50 else ''
|
|
39
|
+
except Exception:
|
|
40
|
+
return ''
|
|
41
|
+
|
|
42
|
+
|
|
43
|
+
def fetch_page_text(url: str, close_browser: bool = True) -> str:
|
|
44
|
+
"""抓取单个页面并提取正文文本。"""
|
|
45
|
+
browser = None
|
|
46
|
+
try:
|
|
47
|
+
page = SessionPage()
|
|
48
|
+
page.get(url, retry=2, timeout=10)
|
|
49
|
+
html = page.html
|
|
50
|
+
|
|
51
|
+
if is_dynamic_page(html):
|
|
52
|
+
if DEBUG:
|
|
53
|
+
print(f" [DEBUG] Dynamic page, using browser | original length: {len(html)}")
|
|
54
|
+
try:
|
|
55
|
+
options = ChromiumOptions()
|
|
56
|
+
options.headless = True
|
|
57
|
+
browser = ChromiumPage(options)
|
|
58
|
+
browser.get(url)
|
|
59
|
+
browser.wait.load_start()
|
|
60
|
+
time.sleep(2)
|
|
61
|
+
html = browser.html
|
|
62
|
+
except Exception as e:
|
|
63
|
+
if DEBUG:
|
|
64
|
+
print(f" [DEBUG] Browser failed, using original: {e}")
|
|
65
|
+
if browser:
|
|
66
|
+
browser.quit()
|
|
67
|
+
browser = None
|
|
68
|
+
|
|
69
|
+
return extract_text(html)
|
|
70
|
+
|
|
71
|
+
except Exception as e:
|
|
72
|
+
if DEBUG:
|
|
73
|
+
print(f" [DEBUG] Fetch failed: {e}")
|
|
74
|
+
return ''
|
|
75
|
+
finally:
|
|
76
|
+
if close_browser and browser:
|
|
77
|
+
browser.quit()
|
|
@@ -0,0 +1,64 @@
|
|
|
1
|
+
"""HTML 解析模块。"""
|
|
2
|
+
|
|
3
|
+
from bs4 import BeautifulSoup
|
|
4
|
+
from readability.readability import Document
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
def parse_info(html):
|
|
8
|
+
"""解析结构化信息:url, title, text, links, meta。"""
|
|
9
|
+
soup = BeautifulSoup(html, 'lxml')
|
|
10
|
+
|
|
11
|
+
# 提取标题
|
|
12
|
+
title = soup.title.string if soup.title else ""
|
|
13
|
+
|
|
14
|
+
# 提取 meta 信息
|
|
15
|
+
meta = {}
|
|
16
|
+
for tag in soup.find_all('meta'):
|
|
17
|
+
name = tag.get('property') or tag.get('name')
|
|
18
|
+
content = tag.get('content', '')
|
|
19
|
+
if name and content:
|
|
20
|
+
meta[name] = content
|
|
21
|
+
|
|
22
|
+
# 提取链接
|
|
23
|
+
links = []
|
|
24
|
+
for a in soup.find_all('a', href=True):
|
|
25
|
+
links.append({
|
|
26
|
+
"text": a.get_text(strip=True),
|
|
27
|
+
"href": a['href']
|
|
28
|
+
})
|
|
29
|
+
|
|
30
|
+
# 提取正文
|
|
31
|
+
doc = Document(html)
|
|
32
|
+
summary_html = doc.summary()
|
|
33
|
+
text_soup = BeautifulSoup(summary_html, 'lxml')
|
|
34
|
+
text = text_soup.get_text(separator=' ', strip=True)
|
|
35
|
+
|
|
36
|
+
return {
|
|
37
|
+
"title": title,
|
|
38
|
+
"text": text,
|
|
39
|
+
"links": links,
|
|
40
|
+
"meta": meta
|
|
41
|
+
}
|
|
42
|
+
|
|
43
|
+
|
|
44
|
+
def parse_raw(html):
|
|
45
|
+
"""返回原始 HTML。"""
|
|
46
|
+
return {"html": html}
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
def parse_text(html):
|
|
50
|
+
"""提取纯文本。"""
|
|
51
|
+
doc = Document(html)
|
|
52
|
+
summary_html = doc.summary()
|
|
53
|
+
soup = BeautifulSoup(summary_html, 'lxml')
|
|
54
|
+
return soup.get_text(separator=' ', strip=True)
|
|
55
|
+
|
|
56
|
+
|
|
57
|
+
def parse_html(html, mode='text'):
|
|
58
|
+
"""根据模式解析 HTML。"""
|
|
59
|
+
if mode == 'info':
|
|
60
|
+
return parse_info(html)
|
|
61
|
+
elif mode == 'raw':
|
|
62
|
+
return parse_raw(html)
|
|
63
|
+
else:
|
|
64
|
+
return parse_text(html)
|
|
@@ -0,0 +1,95 @@
|
|
|
1
|
+
"""页面抓取会话模块。"""
|
|
2
|
+
|
|
3
|
+
import time
|
|
4
|
+
from concurrent.futures import ThreadPoolExecutor, as_completed
|
|
5
|
+
from DrissionPage import SessionPage, ChromiumPage, ChromiumOptions
|
|
6
|
+
from .core import extract_text, is_dynamic_page
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
def fetch_page_session(url: str):
|
|
10
|
+
"""使用 SessionPage 抓取页面。"""
|
|
11
|
+
try:
|
|
12
|
+
page = SessionPage()
|
|
13
|
+
page.get(url, retry=2, timeout=10)
|
|
14
|
+
html = page.html
|
|
15
|
+
text = extract_text(html)
|
|
16
|
+
return text, is_dynamic_page(html)
|
|
17
|
+
except Exception:
|
|
18
|
+
return '', True
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
def fetch_all_pages(urls, verbose=False):
|
|
22
|
+
"""两轮抓取:SessionPage 并行 + Chromium 多标签重试。
|
|
23
|
+
|
|
24
|
+
Args:
|
|
25
|
+
urls: URL 列表
|
|
26
|
+
verbose: 是否打印详细过程(默认 False)
|
|
27
|
+
"""
|
|
28
|
+
if verbose:
|
|
29
|
+
print("[1/2] Fetching pages with SessionPage...")
|
|
30
|
+
results = [None] * len(urls)
|
|
31
|
+
failed_indices = []
|
|
32
|
+
|
|
33
|
+
with ThreadPoolExecutor(max_workers=5) as executor:
|
|
34
|
+
futures = {executor.submit(fetch_page_session, url): i for i, url in enumerate(urls)}
|
|
35
|
+
|
|
36
|
+
for future in as_completed(futures):
|
|
37
|
+
i = futures[future]
|
|
38
|
+
text, maybe_dynamic = future.result()
|
|
39
|
+
results[i] = (i, text, maybe_dynamic)
|
|
40
|
+
|
|
41
|
+
if verbose:
|
|
42
|
+
if text:
|
|
43
|
+
print(f" OK [{i+1}/{len(urls)}] length: {len(text)}")
|
|
44
|
+
else:
|
|
45
|
+
failed_indices.append(i)
|
|
46
|
+
print(f" RETRY [{i+1}/{len(urls)}] will retry with browser")
|
|
47
|
+
elif not text:
|
|
48
|
+
failed_indices.append(i)
|
|
49
|
+
|
|
50
|
+
# 第二轮:Chromium 多标签重试
|
|
51
|
+
if failed_indices:
|
|
52
|
+
if verbose:
|
|
53
|
+
print(f"\n[2/2] Retrying {len(failed_indices)} pages with Chromium (multi-tab)...")
|
|
54
|
+
|
|
55
|
+
retry_urls = [urls[i] for i in failed_indices]
|
|
56
|
+
|
|
57
|
+
try:
|
|
58
|
+
options = ChromiumOptions()
|
|
59
|
+
options.headless = True
|
|
60
|
+
browser = ChromiumPage(options)
|
|
61
|
+
|
|
62
|
+
try:
|
|
63
|
+
browser.get(retry_urls[0])
|
|
64
|
+
browser.wait.load_start()
|
|
65
|
+
time.sleep(2)
|
|
66
|
+
|
|
67
|
+
for url in retry_urls[1:]:
|
|
68
|
+
browser.new_tab(url)
|
|
69
|
+
|
|
70
|
+
time.sleep(3)
|
|
71
|
+
|
|
72
|
+
tab_ids = browser.tab_ids
|
|
73
|
+
|
|
74
|
+
for tab_idx, tab_id in enumerate(tab_ids):
|
|
75
|
+
if tab_idx < len(retry_urls):
|
|
76
|
+
try:
|
|
77
|
+
browser.activate_tab(tab_id)
|
|
78
|
+
time.sleep(1)
|
|
79
|
+
html = browser.html
|
|
80
|
+
text = extract_text(html)
|
|
81
|
+
original_idx = failed_indices[tab_idx]
|
|
82
|
+
results[original_idx] = (original_idx, text, True)
|
|
83
|
+
if verbose:
|
|
84
|
+
print(f" OK [retry {tab_idx+1}/{len(retry_urls)}] length: {len(text)}")
|
|
85
|
+
except Exception as e:
|
|
86
|
+
if verbose:
|
|
87
|
+
print(f" FAIL [retry {tab_idx+1}/{len(retry_urls)}] error: {e}")
|
|
88
|
+
finally:
|
|
89
|
+
browser.quit()
|
|
90
|
+
|
|
91
|
+
except Exception as e:
|
|
92
|
+
if verbose:
|
|
93
|
+
print(f"Chromium retry failed: {e}")
|
|
94
|
+
|
|
95
|
+
return results
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
"""BM25 排序模块。"""
|
|
2
|
+
|
|
3
|
+
import jieba
|
|
4
|
+
from rank_bm25 import BM25Okapi
|
|
5
|
+
from typing import List
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
def rank_documents(documents: List[str], query: str) -> List[float]:
|
|
9
|
+
"""
|
|
10
|
+
使用 BM25 算法对文档进行相关性排序。
|
|
11
|
+
|
|
12
|
+
Args:
|
|
13
|
+
documents: 文档文本列表。
|
|
14
|
+
query: 搜索关键词。
|
|
15
|
+
|
|
16
|
+
Returns:
|
|
17
|
+
与输入文档顺序对应的分数列表。
|
|
18
|
+
"""
|
|
19
|
+
if not documents:
|
|
20
|
+
return []
|
|
21
|
+
|
|
22
|
+
# jieba 中文分词
|
|
23
|
+
tokenized_corpus = [list(jieba.cut(doc)) for doc in documents]
|
|
24
|
+
bm25 = BM25Okapi(tokenized_corpus)
|
|
25
|
+
|
|
26
|
+
tokenized_query = list(jieba.cut(query))
|
|
27
|
+
scores = bm25.get_scores(tokenized_query)
|
|
28
|
+
|
|
29
|
+
return scores.tolist()
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
"""Bing 搜索模块。"""
|
|
2
|
+
|
|
3
|
+
from typing import List, Dict
|
|
4
|
+
from DrissionPage import SessionPage
|
|
5
|
+
from bs4 import BeautifulSoup
|
|
6
|
+
from urllib.parse import quote, urlparse, parse_qs
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
def bing_search(query: str, num_results: int = 10) -> List[Dict[str, str]]:
|
|
10
|
+
"""
|
|
11
|
+
从 Bing 获取搜索结果。
|
|
12
|
+
|
|
13
|
+
Args:
|
|
14
|
+
query: 搜索关键词。
|
|
15
|
+
num_results: 返回结果数量。
|
|
16
|
+
|
|
17
|
+
Returns:
|
|
18
|
+
列表,每个元素为 dict,包含 title、url、snippet。
|
|
19
|
+
"""
|
|
20
|
+
url = f"https://www.bing.com/search?q={quote(query)}"
|
|
21
|
+
page = SessionPage()
|
|
22
|
+
page.get(url, retry=2)
|
|
23
|
+
|
|
24
|
+
soup = BeautifulSoup(page.html, 'lxml')
|
|
25
|
+
results = []
|
|
26
|
+
|
|
27
|
+
# Bing 结果在 <li class="b_algo"> 中
|
|
28
|
+
for li in soup.select('li.b_algo')[:num_results]:
|
|
29
|
+
a_tag = li.select_one('h2 a')
|
|
30
|
+
p_tag = li.select_one('p')
|
|
31
|
+
|
|
32
|
+
if not a_tag:
|
|
33
|
+
continue
|
|
34
|
+
|
|
35
|
+
title = a_tag.text.strip()
|
|
36
|
+
href = a_tag.get('href', '')
|
|
37
|
+
url = parse_real_url(href)
|
|
38
|
+
snippet = p_tag.text.strip() if p_tag else ''
|
|
39
|
+
|
|
40
|
+
results.append({
|
|
41
|
+
'title': title,
|
|
42
|
+
'url': url,
|
|
43
|
+
'snippet': snippet,
|
|
44
|
+
})
|
|
45
|
+
|
|
46
|
+
return results
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
def parse_real_url(redirect_url: str) -> str:
|
|
50
|
+
"""
|
|
51
|
+
从 Bing 重定向 URL 中提取真实 URL。
|
|
52
|
+
|
|
53
|
+
Args:
|
|
54
|
+
redirect_url: Bing 重定向 URL,格式如 /url?q=实际URL&...
|
|
55
|
+
|
|
56
|
+
Returns:
|
|
57
|
+
真实 URL 字符串。
|
|
58
|
+
"""
|
|
59
|
+
if not redirect_url:
|
|
60
|
+
return ''
|
|
61
|
+
|
|
62
|
+
if redirect_url.startswith('/url?'):
|
|
63
|
+
parsed = urlparse(redirect_url)
|
|
64
|
+
qs = parse_qs(parsed.query)
|
|
65
|
+
return qs.get('q', [''])[0]
|
|
66
|
+
|
|
67
|
+
return redirect_url
|
|
@@ -0,0 +1,162 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: skysearch
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: Search the web, rank results, fetch any page content.
|
|
5
|
+
Author: zimvir
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/zimvir/skysearch
|
|
8
|
+
Project-URL: Documentation, https://github.com/zimvir/skysearch#readme
|
|
9
|
+
Keywords: search,fetch,bing,bm25,web-scraper,crawler
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
16
|
+
Requires-Python: >=3.11
|
|
17
|
+
Description-Content-Type: text/markdown
|
|
18
|
+
Requires-Dist: DrissionPage
|
|
19
|
+
Requires-Dist: beautifulsoup4
|
|
20
|
+
Requires-Dist: lxml
|
|
21
|
+
Requires-Dist: readability-lxml
|
|
22
|
+
Requires-Dist: jieba
|
|
23
|
+
Requires-Dist: rank-bm25
|
|
24
|
+
|
|
25
|
+
# SkySearch
|
|
26
|
+
|
|
27
|
+
基于 Bing 搜索 + BM25 排序的轻量级搜索引擎,支持动态页面抓取。
|
|
28
|
+
|
|
29
|
+
## 特性
|
|
30
|
+
|
|
31
|
+
- **Bing 搜索**:SessionPage 底层 TLS 指纹伪装,绕过反爬
|
|
32
|
+
- **动态页面**:Chromium 多标签页并行抓取
|
|
33
|
+
- **BM25 排序**:jieba 中文分词 + rank-bm25 相关性排序
|
|
34
|
+
- **多模式输出**:`text` / `info` / `raw` 三种输出模式
|
|
35
|
+
- **CLI 工具**:支持命令行参数,也可用作 Python 库
|
|
36
|
+
|
|
37
|
+
## 安装
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
pip install skysearch
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## 命令行使用
|
|
44
|
+
|
|
45
|
+
### 搜索模式
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
# 交互式输入
|
|
49
|
+
skysearch
|
|
50
|
+
|
|
51
|
+
# 指定关键词
|
|
52
|
+
skysearch "深度学习框架"
|
|
53
|
+
|
|
54
|
+
# 指定结果数量
|
|
55
|
+
skysearch -n 20 "Python教程"
|
|
56
|
+
|
|
57
|
+
# 保持浏览器打开
|
|
58
|
+
skysearch -n 20 "关键词" --keep
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### URL 抓取模式
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
# 默认:输出纯文本
|
|
65
|
+
skysearch --url https://example.com
|
|
66
|
+
|
|
67
|
+
# 指定输出模式
|
|
68
|
+
skysearch --url https://example.com --mode text # 纯文本(默认)
|
|
69
|
+
skysearch --url https://example.com --mode info # 结构化信息
|
|
70
|
+
skysearch --url https://example.com --mode raw # 原始 HTML
|
|
71
|
+
|
|
72
|
+
# 保持浏览器打开
|
|
73
|
+
skysearch --url https://example.com --keep
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
## 作为库使用
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
import skysearch
|
|
80
|
+
|
|
81
|
+
# 搜索
|
|
82
|
+
results = skysearch.search("深度学习", num=10)
|
|
83
|
+
# [{'title': ..., 'url': ..., 'score': 12.5, 'snippet': ...}, ...]
|
|
84
|
+
|
|
85
|
+
# URL 抓取
|
|
86
|
+
text = skysearch.fetch("https://example.com")
|
|
87
|
+
info = skysearch.fetch("https://example.com", mode='info')
|
|
88
|
+
raw = skysearch.fetch("https://example.com", mode='raw')
|
|
89
|
+
|
|
90
|
+
# 单独函数
|
|
91
|
+
links = skysearch.fetch_links("https://example.com")
|
|
92
|
+
info_dict = skysearch.fetch_info("https://example.com")
|
|
93
|
+
raw_dict = skysearch.fetch_raw("https://example.com")
|
|
94
|
+
|
|
95
|
+
# 搜索 + 抓取一体化
|
|
96
|
+
results = skysearch.search_and_fetch("关键词", mode='info')
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
## API 参数说明
|
|
100
|
+
|
|
101
|
+
### search(query, num=10, verbose=False, keep=False, tuple_format=False)
|
|
102
|
+
|
|
103
|
+
| 参数 | 说明 | 默认值 |
|
|
104
|
+
|------|------|--------|
|
|
105
|
+
| query | 搜索关键词 | - |
|
|
106
|
+
| num | 结果数量 | 10 |
|
|
107
|
+
| verbose | 打印详细过程 | False |
|
|
108
|
+
| keep | 保持浏览器打开 | False |
|
|
109
|
+
| tuple_format | 返回元组格式 | False |
|
|
110
|
+
|
|
111
|
+
### fetch(url, mode='text', keep=False, timeout=10, retry=2)
|
|
112
|
+
|
|
113
|
+
| 参数 | 说明 | 默认值 |
|
|
114
|
+
|------|------|--------|
|
|
115
|
+
| url | 页面 URL | - |
|
|
116
|
+
| mode | 输出模式:`text` `info` `raw` | text |
|
|
117
|
+
| keep | 保持浏览器打开 | False |
|
|
118
|
+
| timeout | 请求超时秒数 | 10 |
|
|
119
|
+
| retry | 重试次数 | 2 |
|
|
120
|
+
|
|
121
|
+
### search_and_fetch(query, num=10, mode='text', verbose=False, keep=False, tuple_format=False)
|
|
122
|
+
|
|
123
|
+
一体化搜索 + 抓取,返回列表包含结果信息和抓取内容。
|
|
124
|
+
|
|
125
|
+
## 输出模式说明
|
|
126
|
+
|
|
127
|
+
| 模式 | 说明 | 适用场景 |
|
|
128
|
+
|------|------|----------|
|
|
129
|
+
| `text` | 纯文本正文 | 人类阅读 |
|
|
130
|
+
| `info` | 结构化 JSON(url, title, text, links, meta) | 数据分析 / agent |
|
|
131
|
+
| `raw` | 原始 HTML | 深度解析 |
|
|
132
|
+
|
|
133
|
+
## 技术栈
|
|
134
|
+
|
|
135
|
+
| 模块 | 技术 |
|
|
136
|
+
|------|------|
|
|
137
|
+
| HTTP 请求 | DrissionPage (SessionPage) |
|
|
138
|
+
| 动态渲染 | DrissionPage (ChromiumPage) |
|
|
139
|
+
| HTML 解析 | BeautifulSoup4 + lxml |
|
|
140
|
+
| 正文提取 | readability-lxml |
|
|
141
|
+
| 中文分词 | jieba |
|
|
142
|
+
| 排序算法 | rank-bm25 (BM25Okapi) |
|
|
143
|
+
|
|
144
|
+
## 项目结构
|
|
145
|
+
|
|
146
|
+
```
|
|
147
|
+
src/skysearch/
|
|
148
|
+
├── __init__.py # 库入口,导出所有 API
|
|
149
|
+
├── cli.py # 命令行入口
|
|
150
|
+
├── search.py # Bing 搜索
|
|
151
|
+
├── ranker.py # BM25 排序
|
|
152
|
+
├── api.py # 简洁 API 接口
|
|
153
|
+
└── fetcher/ # 页面抓取包
|
|
154
|
+
├── __init__.py
|
|
155
|
+
├── core.py # 核心函数
|
|
156
|
+
├── session.py # 会话管理
|
|
157
|
+
└── parser.py # HTML 解析
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
## License
|
|
161
|
+
|
|
162
|
+
MIT
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
README.md
|
|
2
|
+
pyproject.toml
|
|
3
|
+
src/skysearch/__init__.py
|
|
4
|
+
src/skysearch/api.py
|
|
5
|
+
src/skysearch/cli.py
|
|
6
|
+
src/skysearch/ranker.py
|
|
7
|
+
src/skysearch/search.py
|
|
8
|
+
src/skysearch.egg-info/PKG-INFO
|
|
9
|
+
src/skysearch.egg-info/SOURCES.txt
|
|
10
|
+
src/skysearch.egg-info/dependency_links.txt
|
|
11
|
+
src/skysearch.egg-info/entry_points.txt
|
|
12
|
+
src/skysearch.egg-info/requires.txt
|
|
13
|
+
src/skysearch.egg-info/top_level.txt
|
|
14
|
+
src/skysearch/fetcher/__init__.py
|
|
15
|
+
src/skysearch/fetcher/core.py
|
|
16
|
+
src/skysearch/fetcher/parser.py
|
|
17
|
+
src/skysearch/fetcher/session.py
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
skysearch
|