skysearch 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,162 @@
1
+ Metadata-Version: 2.4
2
+ Name: skysearch
3
+ Version: 0.2.0
4
+ Summary: Search the web, rank results, fetch any page content.
5
+ Author: zimvir
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/zimvir/skysearch
8
+ Project-URL: Documentation, https://github.com/zimvir/skysearch#readme
9
+ Keywords: search,fetch,bing,bm25,web-scraper,crawler
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.11
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Requires-Python: >=3.11
17
+ Description-Content-Type: text/markdown
18
+ Requires-Dist: DrissionPage
19
+ Requires-Dist: beautifulsoup4
20
+ Requires-Dist: lxml
21
+ Requires-Dist: readability-lxml
22
+ Requires-Dist: jieba
23
+ Requires-Dist: rank-bm25
24
+
25
+ # SkySearch
26
+
27
+ 基于 Bing 搜索 + BM25 排序的轻量级搜索引擎,支持动态页面抓取。
28
+
29
+ ## 特性
30
+
31
+ - **Bing 搜索**:SessionPage 底层 TLS 指纹伪装,绕过反爬
32
+ - **动态页面**:Chromium 多标签页并行抓取
33
+ - **BM25 排序**:jieba 中文分词 + rank-bm25 相关性排序
34
+ - **多模式输出**:`text` / `info` / `raw` 三种输出模式
35
+ - **CLI 工具**:支持命令行参数,也可用作 Python 库
36
+
37
+ ## 安装
38
+
39
+ ```bash
40
+ pip install skysearch
41
+ ```
42
+
43
+ ## 命令行使用
44
+
45
+ ### 搜索模式
46
+
47
+ ```bash
48
+ # 交互式输入
49
+ skysearch
50
+
51
+ # 指定关键词
52
+ skysearch "深度学习框架"
53
+
54
+ # 指定结果数量
55
+ skysearch -n 20 "Python教程"
56
+
57
+ # 保持浏览器打开
58
+ skysearch -n 20 "关键词" --keep
59
+ ```
60
+
61
+ ### URL 抓取模式
62
+
63
+ ```bash
64
+ # 默认:输出纯文本
65
+ skysearch --url https://example.com
66
+
67
+ # 指定输出模式
68
+ skysearch --url https://example.com --mode text # 纯文本(默认)
69
+ skysearch --url https://example.com --mode info # 结构化信息
70
+ skysearch --url https://example.com --mode raw # 原始 HTML
71
+
72
+ # 保持浏览器打开
73
+ skysearch --url https://example.com --keep
74
+ ```
75
+
76
+ ## 作为库使用
77
+
78
+ ```python
79
+ import skysearch
80
+
81
+ # 搜索
82
+ results = skysearch.search("深度学习", num=10)
83
+ # [{'title': ..., 'url': ..., 'score': 12.5, 'snippet': ...}, ...]
84
+
85
+ # URL 抓取
86
+ text = skysearch.fetch("https://example.com")
87
+ info = skysearch.fetch("https://example.com", mode='info')
88
+ raw = skysearch.fetch("https://example.com", mode='raw')
89
+
90
+ # 单独函数
91
+ links = skysearch.fetch_links("https://example.com")
92
+ info_dict = skysearch.fetch_info("https://example.com")
93
+ raw_dict = skysearch.fetch_raw("https://example.com")
94
+
95
+ # 搜索 + 抓取一体化
96
+ results = skysearch.search_and_fetch("关键词", mode='info')
97
+ ```
98
+
99
+ ## API 参数说明
100
+
101
+ ### search(query, num=10, verbose=False, keep=False, tuple_format=False)
102
+
103
+ | 参数 | 说明 | 默认值 |
104
+ |------|------|--------|
105
+ | query | 搜索关键词 | - |
106
+ | num | 结果数量 | 10 |
107
+ | verbose | 打印详细过程 | False |
108
+ | keep | 保持浏览器打开 | False |
109
+ | tuple_format | 返回元组格式 | False |
110
+
111
+ ### fetch(url, mode='text', keep=False, timeout=10, retry=2)
112
+
113
+ | 参数 | 说明 | 默认值 |
114
+ |------|------|--------|
115
+ | url | 页面 URL | - |
116
+ | mode | 输出模式:`text` `info` `raw` | text |
117
+ | keep | 保持浏览器打开 | False |
118
+ | timeout | 请求超时秒数 | 10 |
119
+ | retry | 重试次数 | 2 |
120
+
121
+ ### search_and_fetch(query, num=10, mode='text', verbose=False, keep=False, tuple_format=False)
122
+
123
+ 一体化搜索 + 抓取,返回列表包含结果信息和抓取内容。
124
+
125
+ ## 输出模式说明
126
+
127
+ | 模式 | 说明 | 适用场景 |
128
+ |------|------|----------|
129
+ | `text` | 纯文本正文 | 人类阅读 |
130
+ | `info` | 结构化 JSON(url, title, text, links, meta) | 数据分析 / agent |
131
+ | `raw` | 原始 HTML | 深度解析 |
132
+
133
+ ## 技术栈
134
+
135
+ | 模块 | 技术 |
136
+ |------|------|
137
+ | HTTP 请求 | DrissionPage (SessionPage) |
138
+ | 动态渲染 | DrissionPage (ChromiumPage) |
139
+ | HTML 解析 | BeautifulSoup4 + lxml |
140
+ | 正文提取 | readability-lxml |
141
+ | 中文分词 | jieba |
142
+ | 排序算法 | rank-bm25 (BM25Okapi) |
143
+
144
+ ## 项目结构
145
+
146
+ ```
147
+ src/skysearch/
148
+ ├── __init__.py # 库入口,导出所有 API
149
+ ├── cli.py # 命令行入口
150
+ ├── search.py # Bing 搜索
151
+ ├── ranker.py # BM25 排序
152
+ ├── api.py # 简洁 API 接口
153
+ └── fetcher/ # 页面抓取包
154
+ ├── __init__.py
155
+ ├── core.py # 核心函数
156
+ ├── session.py # 会话管理
157
+ └── parser.py # HTML 解析
158
+ ```
159
+
160
+ ## License
161
+
162
+ MIT
@@ -0,0 +1,138 @@
1
+ # SkySearch
2
+
3
+ 基于 Bing 搜索 + BM25 排序的轻量级搜索引擎,支持动态页面抓取。
4
+
5
+ ## 特性
6
+
7
+ - **Bing 搜索**:SessionPage 底层 TLS 指纹伪装,绕过反爬
8
+ - **动态页面**:Chromium 多标签页并行抓取
9
+ - **BM25 排序**:jieba 中文分词 + rank-bm25 相关性排序
10
+ - **多模式输出**:`text` / `info` / `raw` 三种输出模式
11
+ - **CLI 工具**:支持命令行参数,也可用作 Python 库
12
+
13
+ ## 安装
14
+
15
+ ```bash
16
+ pip install skysearch
17
+ ```
18
+
19
+ ## 命令行使用
20
+
21
+ ### 搜索模式
22
+
23
+ ```bash
24
+ # 交互式输入
25
+ skysearch
26
+
27
+ # 指定关键词
28
+ skysearch "深度学习框架"
29
+
30
+ # 指定结果数量
31
+ skysearch -n 20 "Python教程"
32
+
33
+ # 保持浏览器打开
34
+ skysearch -n 20 "关键词" --keep
35
+ ```
36
+
37
+ ### URL 抓取模式
38
+
39
+ ```bash
40
+ # 默认:输出纯文本
41
+ skysearch --url https://example.com
42
+
43
+ # 指定输出模式
44
+ skysearch --url https://example.com --mode text # 纯文本(默认)
45
+ skysearch --url https://example.com --mode info # 结构化信息
46
+ skysearch --url https://example.com --mode raw # 原始 HTML
47
+
48
+ # 保持浏览器打开
49
+ skysearch --url https://example.com --keep
50
+ ```
51
+
52
+ ## 作为库使用
53
+
54
+ ```python
55
+ import skysearch
56
+
57
+ # 搜索
58
+ results = skysearch.search("深度学习", num=10)
59
+ # [{'title': ..., 'url': ..., 'score': 12.5, 'snippet': ...}, ...]
60
+
61
+ # URL 抓取
62
+ text = skysearch.fetch("https://example.com")
63
+ info = skysearch.fetch("https://example.com", mode='info')
64
+ raw = skysearch.fetch("https://example.com", mode='raw')
65
+
66
+ # 单独函数
67
+ links = skysearch.fetch_links("https://example.com")
68
+ info_dict = skysearch.fetch_info("https://example.com")
69
+ raw_dict = skysearch.fetch_raw("https://example.com")
70
+
71
+ # 搜索 + 抓取一体化
72
+ results = skysearch.search_and_fetch("关键词", mode='info')
73
+ ```
74
+
75
+ ## API 参数说明
76
+
77
+ ### search(query, num=10, verbose=False, keep=False, tuple_format=False)
78
+
79
+ | 参数 | 说明 | 默认值 |
80
+ |------|------|--------|
81
+ | query | 搜索关键词 | - |
82
+ | num | 结果数量 | 10 |
83
+ | verbose | 打印详细过程 | False |
84
+ | keep | 保持浏览器打开 | False |
85
+ | tuple_format | 返回元组格式 | False |
86
+
87
+ ### fetch(url, mode='text', keep=False, timeout=10, retry=2)
88
+
89
+ | 参数 | 说明 | 默认值 |
90
+ |------|------|--------|
91
+ | url | 页面 URL | - |
92
+ | mode | 输出模式:`text` `info` `raw` | text |
93
+ | keep | 保持浏览器打开 | False |
94
+ | timeout | 请求超时秒数 | 10 |
95
+ | retry | 重试次数 | 2 |
96
+
97
+ ### search_and_fetch(query, num=10, mode='text', verbose=False, keep=False, tuple_format=False)
98
+
99
+ 一体化搜索 + 抓取,返回列表包含结果信息和抓取内容。
100
+
101
+ ## 输出模式说明
102
+
103
+ | 模式 | 说明 | 适用场景 |
104
+ |------|------|----------|
105
+ | `text` | 纯文本正文 | 人类阅读 |
106
+ | `info` | 结构化 JSON(url, title, text, links, meta) | 数据分析 / agent |
107
+ | `raw` | 原始 HTML | 深度解析 |
108
+
109
+ ## 技术栈
110
+
111
+ | 模块 | 技术 |
112
+ |------|------|
113
+ | HTTP 请求 | DrissionPage (SessionPage) |
114
+ | 动态渲染 | DrissionPage (ChromiumPage) |
115
+ | HTML 解析 | BeautifulSoup4 + lxml |
116
+ | 正文提取 | readability-lxml |
117
+ | 中文分词 | jieba |
118
+ | 排序算法 | rank-bm25 (BM25Okapi) |
119
+
120
+ ## 项目结构
121
+
122
+ ```
123
+ src/skysearch/
124
+ ├── __init__.py # 库入口,导出所有 API
125
+ ├── cli.py # 命令行入口
126
+ ├── search.py # Bing 搜索
127
+ ├── ranker.py # BM25 排序
128
+ ├── api.py # 简洁 API 接口
129
+ └── fetcher/ # 页面抓取包
130
+ ├── __init__.py
131
+ ├── core.py # 核心函数
132
+ ├── session.py # 会话管理
133
+ └── parser.py # HTML 解析
134
+ ```
135
+
136
+ ## License
137
+
138
+ MIT
@@ -0,0 +1,42 @@
1
+ [project]
2
+ name = "skysearch"
3
+ version = "0.2.0"
4
+ description = "Search the web, rank results, fetch any page content."
5
+ readme = "README.md"
6
+ requires-python = ">=3.11"
7
+ license = {text = "MIT"}
8
+ authors = [
9
+ {name = "zimvir"}
10
+ ]
11
+ keywords = ["search", "fetch", "bing", "bm25", "web-scraper", "crawler"]
12
+ classifiers = [
13
+ "Development Status :: 3 - Alpha",
14
+ "Intended Audience :: Developers",
15
+ "License :: OSI Approved :: MIT License",
16
+ "Programming Language :: Python :: 3",
17
+ "Programming Language :: Python :: 3.11",
18
+ "Programming Language :: Python :: 3.12",
19
+ ]
20
+
21
+ dependencies = [
22
+ "DrissionPage",
23
+ "beautifulsoup4",
24
+ "lxml",
25
+ "readability-lxml",
26
+ "jieba",
27
+ "rank-bm25",
28
+ ]
29
+
30
+ [project.scripts]
31
+ skysearch = "skysearch.cli:main"
32
+
33
+ [project.urls]
34
+ Homepage = "https://github.com/zimvir/skysearch"
35
+ Documentation = "https://github.com/zimvir/skysearch#readme"
36
+
37
+ [build-system]
38
+ requires = ["setuptools>=61.0"]
39
+ build-backend = "setuptools.build_meta"
40
+
41
+ [tool.setuptools.packages.find]
42
+ where = ["src"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,40 @@
1
+ """SkySearch - 基于 Bing 搜索 + BM25 排序的轻量级搜索引擎。"""
2
+
3
+ __version__ = "0.1.0"
4
+
5
+ from .search import bing_search
6
+ from .fetcher import (
7
+ fetch_page_text,
8
+ fetch_all_pages,
9
+ is_dynamic_page,
10
+ parse_html,
11
+ parse_info,
12
+ )
13
+ from .ranker import rank_documents
14
+ from . import api
15
+
16
+ __all__ = [
17
+ 'bing_search',
18
+ 'fetch_page_text',
19
+ 'fetch_all_pages',
20
+ 'is_dynamic_page',
21
+ 'parse_html',
22
+ 'parse_info',
23
+ 'rank_documents',
24
+ 'api',
25
+ # API 函数
26
+ 'search',
27
+ 'fetch',
28
+ 'fetch_info',
29
+ 'fetch_raw',
30
+ 'fetch_links',
31
+ 'search_and_fetch',
32
+ ]
33
+
34
+ # 快捷引用
35
+ search = api.search
36
+ fetch = api.fetch
37
+ fetch_info = api.fetch_info
38
+ fetch_raw = api.fetch_raw
39
+ fetch_links = api.fetch_links
40
+ search_and_fetch = api.search_and_fetch
@@ -0,0 +1,279 @@
1
+ """简洁 API 接口,模拟 CLI 行为。"""
2
+
3
+ import json
4
+ from .search import bing_search as _bing_search
5
+ from .fetcher import (
6
+ fetch_page_text as _fetch_page_text,
7
+ fetch_all_pages as _fetch_all_pages,
8
+ parse_html as _parse_html,
9
+ is_dynamic_page as _is_dynamic_page,
10
+ parse_info as _parse_info,
11
+ )
12
+ from .ranker import rank_documents as _rank_documents
13
+ from DrissionPage import SessionPage, ChromiumOptions, ChromiumPage
14
+ from bs4 import BeautifulSoup
15
+ import time
16
+
17
+
18
+ def search(query, num=10, verbose=False, keep=False, tuple_format=False):
19
+ """
20
+ 搜索模式(类似 CLI: skysearch "关键词")。
21
+
22
+ Args:
23
+ query: 搜索关键词
24
+ num: 结果数量(默认 10)
25
+ verbose: 是否打印详细过程(默认 False)
26
+ keep: 是否保持浏览器打开(默认 False)
27
+ tuple_format: 是否返回元组格式(默认 False)
28
+ False: [{"title": ..., "url": ..., "score": 12.5}, ...]
29
+ True: [({"title": ..., "url": ...}, 12.5), ...]
30
+
31
+ Returns:
32
+ 排序后的结果列表
33
+ """
34
+ if verbose:
35
+ print(f"Searching Bing for: {query}\n")
36
+
37
+ results = _bing_search(query, num_results=num)
38
+ if not results:
39
+ if verbose:
40
+ print("No results found.")
41
+ return []
42
+
43
+ if verbose:
44
+ print(f"Found {len(results)} results, fetching content...\n")
45
+
46
+ urls = [r['url'] for r in results]
47
+ raw_results = _fetch_all_pages(urls, verbose=verbose)
48
+ raw_results.sort(key=lambda x: x[0])
49
+
50
+ texts = []
51
+ valid_results = []
52
+
53
+ for i, (idx, text, is_dynamic) in enumerate(raw_results):
54
+ if text:
55
+ texts.append(text)
56
+ valid_results.append(results[i])
57
+ if verbose:
58
+ status = "[D]" if is_dynamic else "[S]"
59
+ print(f"[{i+1}] {status} {results[i]['title'][:40]} | length: {len(text)}")
60
+
61
+ if not texts:
62
+ if verbose:
63
+ print("\nNo page content fetched.")
64
+ return []
65
+
66
+ if verbose:
67
+ print(f"\nRanking... (valid pages: {len(texts)})\n")
68
+
69
+ scores = _rank_documents(texts, query)
70
+
71
+ if tuple_format:
72
+ ranked = list(zip(valid_results, scores))
73
+ ranked.sort(key=lambda x: x[1], reverse=True)
74
+ return ranked
75
+
76
+ # 默认返回扁平列表
77
+ output = []
78
+ for r, score in zip(valid_results, scores):
79
+ item = dict(r)
80
+ item['score'] = score
81
+ output.append(item)
82
+
83
+ output.sort(key=lambda x: x['score'], reverse=True)
84
+ return output
85
+
86
+
87
+ def fetch(url, mode='text', keep=False, timeout=10, retry=2):
88
+ """
89
+ URL 抓取模式(类似 CLI: skysearch --url xxx)。
90
+
91
+ Args:
92
+ url: 页面 URL
93
+ mode: 输出模式,'text' | 'info' | 'raw'(默认 'text')
94
+ keep: 是否保持浏览器打开(默认 False)
95
+ timeout: 请求超时秒数(默认 10)
96
+ retry: 重试次数(默认 2)
97
+
98
+ Returns:
99
+ 根据模式返回不同内容
100
+ """
101
+ page = SessionPage()
102
+ page.get(url, retry=retry, timeout=timeout)
103
+ html = page.html
104
+
105
+ # 动态页面用浏览器渲染
106
+ if _is_dynamic_page(html):
107
+ options = ChromiumOptions()
108
+ options.headless = True
109
+ browser = ChromiumPage(options)
110
+ try:
111
+ browser.get(url)
112
+ browser.wait.load_start()
113
+ time.sleep(2)
114
+ html = browser.html
115
+ finally:
116
+ if not keep:
117
+ browser.quit()
118
+
119
+ if mode == 'raw':
120
+ return json.dumps({"url": url, "html": html}, ensure_ascii=False)
121
+
122
+ if mode == 'info':
123
+ result = {"url": url}
124
+ result.update(_parse_html(html, mode='info'))
125
+ return json.dumps(result, ensure_ascii=False, indent=2)
126
+
127
+ if mode == 'text':
128
+ return _fetch_page_text(url, close_browser=not keep)
129
+
130
+ return None
131
+
132
+
133
+ def fetch_info(url, keep=False, timeout=10, retry=2):
134
+ """
135
+ 获取结构化信息(url, title, text, links, meta)。
136
+
137
+ Args:
138
+ url: 页面 URL
139
+ keep: 是否保持浏览器打开(默认 False)
140
+ timeout: 请求超时秒数(默认 10)
141
+ retry: 重试次数(默认 2)
142
+
143
+ Returns:
144
+ dict: {"url": ..., "title": ..., "text": ..., "links": [...], "meta": {...}}
145
+ """
146
+ page = SessionPage()
147
+ page.get(url, retry=retry, timeout=timeout)
148
+ html = page.html
149
+
150
+ if _is_dynamic_page(html):
151
+ options = ChromiumOptions()
152
+ options.headless = True
153
+ browser = ChromiumPage(options)
154
+ try:
155
+ browser.get(url)
156
+ browser.wait.load_start()
157
+ time.sleep(2)
158
+ html = browser.html
159
+ finally:
160
+ if not keep:
161
+ browser.quit()
162
+
163
+ return {"url": url, **_parse_info(html)}
164
+
165
+
166
+ def fetch_raw(url, keep=False, timeout=10, retry=2):
167
+ """
168
+ 获取原始 HTML。
169
+
170
+ Args:
171
+ url: 页面 URL
172
+ keep: 是否保持浏览器打开(默认 False)
173
+ timeout: 请求超时秒数(默认 10)
174
+ retry: 重试次数(默认 2)
175
+
176
+ Returns:
177
+ dict: {"url": ..., "html": ...}
178
+ """
179
+ page = SessionPage()
180
+ page.get(url, retry=retry, timeout=timeout)
181
+ html = page.html
182
+
183
+ if _is_dynamic_page(html):
184
+ options = ChromiumOptions()
185
+ options.headless = True
186
+ browser = ChromiumPage(options)
187
+ try:
188
+ browser.get(url)
189
+ browser.wait.load_start()
190
+ time.sleep(2)
191
+ html = browser.html
192
+ finally:
193
+ if not keep:
194
+ browser.quit()
195
+
196
+ return {"url": url, "html": html}
197
+
198
+
199
+ def fetch_links(url, timeout=10, retry=2):
200
+ """
201
+ 提取页面所有链接。
202
+
203
+ Args:
204
+ url: 页面 URL
205
+ timeout: 请求超时秒数(默认 10)
206
+ retry: 重试次数(默认 2)
207
+
208
+ Returns:
209
+ list: [{"text": "链接文字", "href": "https://..."}, ...]
210
+ """
211
+ page = SessionPage()
212
+ page.get(url, retry=retry, timeout=timeout)
213
+ html = page.html
214
+
215
+ if _is_dynamic_page(html):
216
+ options = ChromiumOptions()
217
+ options.headless = True
218
+ browser = ChromiumPage(options)
219
+ try:
220
+ browser.get(url)
221
+ browser.wait.load_start()
222
+ time.sleep(2)
223
+ html = browser.html
224
+ finally:
225
+ browser.quit()
226
+
227
+ soup = BeautifulSoup(html, 'lxml')
228
+ links = []
229
+ for a in soup.find_all('a', href=True):
230
+ links.append({
231
+ "text": a.get_text(strip=True),
232
+ "href": a['href']
233
+ })
234
+
235
+ return links
236
+
237
+
238
+ def search_and_fetch(query, num=10, mode='text', verbose=False, keep=False, tuple_format=False):
239
+ """
240
+ 搜索 + 抓取一体化。
241
+
242
+ Args:
243
+ query: 搜索关键词
244
+ num: 结果数量(默认 10)
245
+ mode: 输出模式,'text' | 'info' | 'raw'(默认 'text')
246
+ verbose: 是否打印详细过程(默认 False)
247
+ keep: 是否保持浏览器打开(默认 False)
248
+ tuple_format: 是否返回元组格式(默认 False)
249
+
250
+ Returns:
251
+ list: 搜索结果列表,每个元素包含结果信息和抓取内容
252
+ """
253
+ results = search(query, num=num, verbose=verbose, keep=keep, tuple_format=tuple_format)
254
+
255
+ output = []
256
+ for item in results:
257
+ if tuple_format:
258
+ url = item[0]['url']
259
+ score = item[1]
260
+ result_dict = item[0]
261
+ else:
262
+ url = item['url']
263
+ score = item.get('score', 0)
264
+ result_dict = item
265
+
266
+ if mode == 'raw':
267
+ content = fetch_raw(url, keep=keep)
268
+ elif mode == 'info':
269
+ content = fetch_info(url, keep=keep)
270
+ else:
271
+ content = fetch(url, mode='text', keep=keep)
272
+
273
+ output.append({
274
+ **result_dict,
275
+ "score": score,
276
+ "content": content
277
+ })
278
+
279
+ return output
@@ -0,0 +1,129 @@
1
+ """命令行入口。"""
2
+
3
+ import argparse
4
+ import time
5
+ import json
6
+ from .search import bing_search
7
+ from .fetcher import fetch_all_pages, fetch_page_text, is_dynamic_page, parse_html
8
+ from .ranker import rank_documents
9
+ from DrissionPage import SessionPage, ChromiumOptions, ChromiumPage
10
+
11
+
12
+ def handle_url(args):
13
+ """处理 --url 模式。"""
14
+ print(f"Fetching: {args.url}\n")
15
+
16
+ page = SessionPage()
17
+ page.get(args.url, retry=2, timeout=10)
18
+ html = page.html
19
+
20
+ # 动态页面用浏览器渲染
21
+ if is_dynamic_page(html):
22
+ options = ChromiumOptions()
23
+ options.headless = True
24
+ browser = ChromiumPage(options)
25
+ try:
26
+ browser.get(args.url)
27
+ browser.wait.load_start()
28
+ time.sleep(2)
29
+ html = browser.html
30
+ finally:
31
+ if not args.keep:
32
+ browser.quit()
33
+
34
+ # 根据模式解析输出
35
+ if args.mode == 'raw':
36
+ result = {"url": args.url, "html": html}
37
+ print(json.dumps(result, ensure_ascii=False))
38
+ elif args.mode == 'info':
39
+ result = {"url": args.url}
40
+ result.update(parse_html(html, mode='info'))
41
+ print(json.dumps(result, ensure_ascii=False, indent=2))
42
+ else:
43
+ text = fetch_page_text(args.url, close_browser=not args.keep)
44
+ if args.keep:
45
+ print("Browser kept open (use Ctrl+C to exit)")
46
+ elif text:
47
+ print(f"Content ({len(text)} chars):\n")
48
+ print(text)
49
+ else:
50
+ print("Failed to fetch page content.")
51
+
52
+
53
+ def handle_search(args):
54
+ """处理搜索模式。"""
55
+ query = args.query or args.query2
56
+ if not query:
57
+ query = input("Enter search query: ")
58
+
59
+ print("Searching Bing...\n")
60
+
61
+ results = bing_search(query, num_results=args.num)
62
+
63
+ if not results:
64
+ print("No results found.")
65
+ return
66
+
67
+ print(f"Found {len(results)} results, fetching content...\n")
68
+
69
+ urls = [r['url'] for r in results]
70
+ raw_results = fetch_all_pages(urls, verbose=True)
71
+ raw_results.sort(key=lambda x: x[0])
72
+
73
+ texts = []
74
+ valid_results = []
75
+
76
+ print("="*50)
77
+ print("Fetch summary:")
78
+ print("="*50)
79
+
80
+ for i, (idx, text, is_dynamic) in enumerate(raw_results):
81
+ if text:
82
+ texts.append(text)
83
+ valid_results.append(results[i])
84
+ status = "[D]" if is_dynamic else "[S]"
85
+ else:
86
+ status = "[X]"
87
+
88
+ print(f"[{i+1}] {status} {results[i]['title'][:40]} | length: {len(text) if text else 0}")
89
+
90
+ if not texts:
91
+ print("\nNo page content fetched.")
92
+ return
93
+
94
+ print(f"\nRanking... (valid pages: {len(texts)})\n")
95
+
96
+ scores = rank_documents(texts, query)
97
+ ranked = list(zip(valid_results, scores))
98
+ ranked.sort(key=lambda x: x[1], reverse=True)
99
+
100
+ print("="*50)
101
+ print("Ranked results (higher score = more relevant):")
102
+ print("="*50)
103
+
104
+ for i, (r, score) in enumerate(ranked[:10], 1):
105
+ print(f"\n{i}. Score: {score:.3f}")
106
+ print(f" Title: {r['title']}")
107
+ print(f" URL: {r['url']}")
108
+
109
+
110
+ def main():
111
+ """主入口函数。"""
112
+ parser = argparse.ArgumentParser(
113
+ description='SkySearch - A lightweight search engine based on Bing + BM25',
114
+ usage='%(prog)s [-n NUM] [-q QUERY] [QUERY]\n'
115
+ ' %(prog)s --url <url>'
116
+ )
117
+ parser.add_argument('query', nargs='?', help='search query')
118
+ parser.add_argument('-n', '--num', type=int, default=10, help='number of results (default: 10)')
119
+ parser.add_argument('-q', '--query', dest='query2', help='search query (long form)')
120
+ parser.add_argument('--url', help='fetch a single URL directly')
121
+ parser.add_argument('--mode', choices=['text', 'info', 'raw'], default='text', help='output mode (default: text)')
122
+ parser.add_argument('--keep', action='store_true', help='keep browser open after fetching')
123
+
124
+ args = parser.parse_args()
125
+
126
+ if args.url:
127
+ handle_url(args)
128
+ else:
129
+ handle_search(args)
@@ -0,0 +1,17 @@
1
+ """页面抓取模块。"""
2
+
3
+ from .core import fetch_page_text, is_dynamic_page, extract_text
4
+ from .session import fetch_page_session, fetch_all_pages
5
+ from .parser import parse_html, parse_info, parse_raw, parse_text
6
+
7
+ __all__ = [
8
+ 'fetch_page_text',
9
+ 'is_dynamic_page',
10
+ 'extract_text',
11
+ 'fetch_page_session',
12
+ 'fetch_all_pages',
13
+ 'parse_html',
14
+ 'parse_info',
15
+ 'parse_raw',
16
+ 'parse_text',
17
+ ]
@@ -0,0 +1,77 @@
1
+ """页面抓取核心模块。"""
2
+
3
+ import time
4
+ import os
5
+ from DrissionPage import SessionPage, ChromiumPage, ChromiumOptions
6
+ from bs4 import BeautifulSoup
7
+ from readability.readability import Document
8
+
9
+ DEBUG = os.getenv('SKYSEARCH_DEBUG', 'false').lower() in ('1', 'true', 'yes')
10
+
11
+
12
+ def is_dynamic_page(html: str) -> bool:
13
+ """判断页面是否需要浏览器渲染。"""
14
+ dynamic_markers = ['id="root"', 'id="app"', 'id="__next"', '__NEXT_DATA__']
15
+
16
+ for marker in dynamic_markers:
17
+ if marker in html:
18
+ return True
19
+
20
+ soup = BeautifulSoup(html, 'lxml')
21
+ text = soup.get_text()
22
+ if len(text) < 500:
23
+ return True
24
+
25
+ return False
26
+
27
+
28
+ def extract_text(html: str) -> str:
29
+ """使用 readability-lxml 提取正文文本。"""
30
+ if not html or len(html.strip()) == 0:
31
+ return ''
32
+
33
+ try:
34
+ doc = Document(html)
35
+ summary_html = doc.summary()
36
+ soup = BeautifulSoup(summary_html, 'lxml')
37
+ text = soup.get_text(separator=' ', strip=True)
38
+ return text if len(text) >= 50 else ''
39
+ except Exception:
40
+ return ''
41
+
42
+
43
+ def fetch_page_text(url: str, close_browser: bool = True) -> str:
44
+ """抓取单个页面并提取正文文本。"""
45
+ browser = None
46
+ try:
47
+ page = SessionPage()
48
+ page.get(url, retry=2, timeout=10)
49
+ html = page.html
50
+
51
+ if is_dynamic_page(html):
52
+ if DEBUG:
53
+ print(f" [DEBUG] Dynamic page, using browser | original length: {len(html)}")
54
+ try:
55
+ options = ChromiumOptions()
56
+ options.headless = True
57
+ browser = ChromiumPage(options)
58
+ browser.get(url)
59
+ browser.wait.load_start()
60
+ time.sleep(2)
61
+ html = browser.html
62
+ except Exception as e:
63
+ if DEBUG:
64
+ print(f" [DEBUG] Browser failed, using original: {e}")
65
+ if browser:
66
+ browser.quit()
67
+ browser = None
68
+
69
+ return extract_text(html)
70
+
71
+ except Exception as e:
72
+ if DEBUG:
73
+ print(f" [DEBUG] Fetch failed: {e}")
74
+ return ''
75
+ finally:
76
+ if close_browser and browser:
77
+ browser.quit()
@@ -0,0 +1,64 @@
1
+ """HTML 解析模块。"""
2
+
3
+ from bs4 import BeautifulSoup
4
+ from readability.readability import Document
5
+
6
+
7
+ def parse_info(html):
8
+ """解析结构化信息:url, title, text, links, meta。"""
9
+ soup = BeautifulSoup(html, 'lxml')
10
+
11
+ # 提取标题
12
+ title = soup.title.string if soup.title else ""
13
+
14
+ # 提取 meta 信息
15
+ meta = {}
16
+ for tag in soup.find_all('meta'):
17
+ name = tag.get('property') or tag.get('name')
18
+ content = tag.get('content', '')
19
+ if name and content:
20
+ meta[name] = content
21
+
22
+ # 提取链接
23
+ links = []
24
+ for a in soup.find_all('a', href=True):
25
+ links.append({
26
+ "text": a.get_text(strip=True),
27
+ "href": a['href']
28
+ })
29
+
30
+ # 提取正文
31
+ doc = Document(html)
32
+ summary_html = doc.summary()
33
+ text_soup = BeautifulSoup(summary_html, 'lxml')
34
+ text = text_soup.get_text(separator=' ', strip=True)
35
+
36
+ return {
37
+ "title": title,
38
+ "text": text,
39
+ "links": links,
40
+ "meta": meta
41
+ }
42
+
43
+
44
+ def parse_raw(html):
45
+ """返回原始 HTML。"""
46
+ return {"html": html}
47
+
48
+
49
+ def parse_text(html):
50
+ """提取纯文本。"""
51
+ doc = Document(html)
52
+ summary_html = doc.summary()
53
+ soup = BeautifulSoup(summary_html, 'lxml')
54
+ return soup.get_text(separator=' ', strip=True)
55
+
56
+
57
+ def parse_html(html, mode='text'):
58
+ """根据模式解析 HTML。"""
59
+ if mode == 'info':
60
+ return parse_info(html)
61
+ elif mode == 'raw':
62
+ return parse_raw(html)
63
+ else:
64
+ return parse_text(html)
@@ -0,0 +1,95 @@
1
+ """页面抓取会话模块。"""
2
+
3
+ import time
4
+ from concurrent.futures import ThreadPoolExecutor, as_completed
5
+ from DrissionPage import SessionPage, ChromiumPage, ChromiumOptions
6
+ from .core import extract_text, is_dynamic_page
7
+
8
+
9
+ def fetch_page_session(url: str):
10
+ """使用 SessionPage 抓取页面。"""
11
+ try:
12
+ page = SessionPage()
13
+ page.get(url, retry=2, timeout=10)
14
+ html = page.html
15
+ text = extract_text(html)
16
+ return text, is_dynamic_page(html)
17
+ except Exception:
18
+ return '', True
19
+
20
+
21
+ def fetch_all_pages(urls, verbose=False):
22
+ """两轮抓取:SessionPage 并行 + Chromium 多标签重试。
23
+
24
+ Args:
25
+ urls: URL 列表
26
+ verbose: 是否打印详细过程(默认 False)
27
+ """
28
+ if verbose:
29
+ print("[1/2] Fetching pages with SessionPage...")
30
+ results = [None] * len(urls)
31
+ failed_indices = []
32
+
33
+ with ThreadPoolExecutor(max_workers=5) as executor:
34
+ futures = {executor.submit(fetch_page_session, url): i for i, url in enumerate(urls)}
35
+
36
+ for future in as_completed(futures):
37
+ i = futures[future]
38
+ text, maybe_dynamic = future.result()
39
+ results[i] = (i, text, maybe_dynamic)
40
+
41
+ if verbose:
42
+ if text:
43
+ print(f" OK [{i+1}/{len(urls)}] length: {len(text)}")
44
+ else:
45
+ failed_indices.append(i)
46
+ print(f" RETRY [{i+1}/{len(urls)}] will retry with browser")
47
+ elif not text:
48
+ failed_indices.append(i)
49
+
50
+ # 第二轮:Chromium 多标签重试
51
+ if failed_indices:
52
+ if verbose:
53
+ print(f"\n[2/2] Retrying {len(failed_indices)} pages with Chromium (multi-tab)...")
54
+
55
+ retry_urls = [urls[i] for i in failed_indices]
56
+
57
+ try:
58
+ options = ChromiumOptions()
59
+ options.headless = True
60
+ browser = ChromiumPage(options)
61
+
62
+ try:
63
+ browser.get(retry_urls[0])
64
+ browser.wait.load_start()
65
+ time.sleep(2)
66
+
67
+ for url in retry_urls[1:]:
68
+ browser.new_tab(url)
69
+
70
+ time.sleep(3)
71
+
72
+ tab_ids = browser.tab_ids
73
+
74
+ for tab_idx, tab_id in enumerate(tab_ids):
75
+ if tab_idx < len(retry_urls):
76
+ try:
77
+ browser.activate_tab(tab_id)
78
+ time.sleep(1)
79
+ html = browser.html
80
+ text = extract_text(html)
81
+ original_idx = failed_indices[tab_idx]
82
+ results[original_idx] = (original_idx, text, True)
83
+ if verbose:
84
+ print(f" OK [retry {tab_idx+1}/{len(retry_urls)}] length: {len(text)}")
85
+ except Exception as e:
86
+ if verbose:
87
+ print(f" FAIL [retry {tab_idx+1}/{len(retry_urls)}] error: {e}")
88
+ finally:
89
+ browser.quit()
90
+
91
+ except Exception as e:
92
+ if verbose:
93
+ print(f"Chromium retry failed: {e}")
94
+
95
+ return results
@@ -0,0 +1,29 @@
1
+ """BM25 排序模块。"""
2
+
3
+ import jieba
4
+ from rank_bm25 import BM25Okapi
5
+ from typing import List
6
+
7
+
8
+ def rank_documents(documents: List[str], query: str) -> List[float]:
9
+ """
10
+ 使用 BM25 算法对文档进行相关性排序。
11
+
12
+ Args:
13
+ documents: 文档文本列表。
14
+ query: 搜索关键词。
15
+
16
+ Returns:
17
+ 与输入文档顺序对应的分数列表。
18
+ """
19
+ if not documents:
20
+ return []
21
+
22
+ # jieba 中文分词
23
+ tokenized_corpus = [list(jieba.cut(doc)) for doc in documents]
24
+ bm25 = BM25Okapi(tokenized_corpus)
25
+
26
+ tokenized_query = list(jieba.cut(query))
27
+ scores = bm25.get_scores(tokenized_query)
28
+
29
+ return scores.tolist()
@@ -0,0 +1,67 @@
1
+ """Bing 搜索模块。"""
2
+
3
+ from typing import List, Dict
4
+ from DrissionPage import SessionPage
5
+ from bs4 import BeautifulSoup
6
+ from urllib.parse import quote, urlparse, parse_qs
7
+
8
+
9
+ def bing_search(query: str, num_results: int = 10) -> List[Dict[str, str]]:
10
+ """
11
+ 从 Bing 获取搜索结果。
12
+
13
+ Args:
14
+ query: 搜索关键词。
15
+ num_results: 返回结果数量。
16
+
17
+ Returns:
18
+ 列表,每个元素为 dict,包含 title、url、snippet。
19
+ """
20
+ url = f"https://www.bing.com/search?q={quote(query)}"
21
+ page = SessionPage()
22
+ page.get(url, retry=2)
23
+
24
+ soup = BeautifulSoup(page.html, 'lxml')
25
+ results = []
26
+
27
+ # Bing 结果在 <li class="b_algo"> 中
28
+ for li in soup.select('li.b_algo')[:num_results]:
29
+ a_tag = li.select_one('h2 a')
30
+ p_tag = li.select_one('p')
31
+
32
+ if not a_tag:
33
+ continue
34
+
35
+ title = a_tag.text.strip()
36
+ href = a_tag.get('href', '')
37
+ url = parse_real_url(href)
38
+ snippet = p_tag.text.strip() if p_tag else ''
39
+
40
+ results.append({
41
+ 'title': title,
42
+ 'url': url,
43
+ 'snippet': snippet,
44
+ })
45
+
46
+ return results
47
+
48
+
49
+ def parse_real_url(redirect_url: str) -> str:
50
+ """
51
+ 从 Bing 重定向 URL 中提取真实 URL。
52
+
53
+ Args:
54
+ redirect_url: Bing 重定向 URL,格式如 /url?q=实际URL&...
55
+
56
+ Returns:
57
+ 真实 URL 字符串。
58
+ """
59
+ if not redirect_url:
60
+ return ''
61
+
62
+ if redirect_url.startswith('/url?'):
63
+ parsed = urlparse(redirect_url)
64
+ qs = parse_qs(parsed.query)
65
+ return qs.get('q', [''])[0]
66
+
67
+ return redirect_url
@@ -0,0 +1,162 @@
1
+ Metadata-Version: 2.4
2
+ Name: skysearch
3
+ Version: 0.2.0
4
+ Summary: Search the web, rank results, fetch any page content.
5
+ Author: zimvir
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/zimvir/skysearch
8
+ Project-URL: Documentation, https://github.com/zimvir/skysearch#readme
9
+ Keywords: search,fetch,bing,bm25,web-scraper,crawler
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.11
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Requires-Python: >=3.11
17
+ Description-Content-Type: text/markdown
18
+ Requires-Dist: DrissionPage
19
+ Requires-Dist: beautifulsoup4
20
+ Requires-Dist: lxml
21
+ Requires-Dist: readability-lxml
22
+ Requires-Dist: jieba
23
+ Requires-Dist: rank-bm25
24
+
25
+ # SkySearch
26
+
27
+ 基于 Bing 搜索 + BM25 排序的轻量级搜索引擎,支持动态页面抓取。
28
+
29
+ ## 特性
30
+
31
+ - **Bing 搜索**:SessionPage 底层 TLS 指纹伪装,绕过反爬
32
+ - **动态页面**:Chromium 多标签页并行抓取
33
+ - **BM25 排序**:jieba 中文分词 + rank-bm25 相关性排序
34
+ - **多模式输出**:`text` / `info` / `raw` 三种输出模式
35
+ - **CLI 工具**:支持命令行参数,也可用作 Python 库
36
+
37
+ ## 安装
38
+
39
+ ```bash
40
+ pip install skysearch
41
+ ```
42
+
43
+ ## 命令行使用
44
+
45
+ ### 搜索模式
46
+
47
+ ```bash
48
+ # 交互式输入
49
+ skysearch
50
+
51
+ # 指定关键词
52
+ skysearch "深度学习框架"
53
+
54
+ # 指定结果数量
55
+ skysearch -n 20 "Python教程"
56
+
57
+ # 保持浏览器打开
58
+ skysearch -n 20 "关键词" --keep
59
+ ```
60
+
61
+ ### URL 抓取模式
62
+
63
+ ```bash
64
+ # 默认:输出纯文本
65
+ skysearch --url https://example.com
66
+
67
+ # 指定输出模式
68
+ skysearch --url https://example.com --mode text # 纯文本(默认)
69
+ skysearch --url https://example.com --mode info # 结构化信息
70
+ skysearch --url https://example.com --mode raw # 原始 HTML
71
+
72
+ # 保持浏览器打开
73
+ skysearch --url https://example.com --keep
74
+ ```
75
+
76
+ ## 作为库使用
77
+
78
+ ```python
79
+ import skysearch
80
+
81
+ # 搜索
82
+ results = skysearch.search("深度学习", num=10)
83
+ # [{'title': ..., 'url': ..., 'score': 12.5, 'snippet': ...}, ...]
84
+
85
+ # URL 抓取
86
+ text = skysearch.fetch("https://example.com")
87
+ info = skysearch.fetch("https://example.com", mode='info')
88
+ raw = skysearch.fetch("https://example.com", mode='raw')
89
+
90
+ # 单独函数
91
+ links = skysearch.fetch_links("https://example.com")
92
+ info_dict = skysearch.fetch_info("https://example.com")
93
+ raw_dict = skysearch.fetch_raw("https://example.com")
94
+
95
+ # 搜索 + 抓取一体化
96
+ results = skysearch.search_and_fetch("关键词", mode='info')
97
+ ```
98
+
99
+ ## API 参数说明
100
+
101
+ ### search(query, num=10, verbose=False, keep=False, tuple_format=False)
102
+
103
+ | 参数 | 说明 | 默认值 |
104
+ |------|------|--------|
105
+ | query | 搜索关键词 | - |
106
+ | num | 结果数量 | 10 |
107
+ | verbose | 打印详细过程 | False |
108
+ | keep | 保持浏览器打开 | False |
109
+ | tuple_format | 返回元组格式 | False |
110
+
111
+ ### fetch(url, mode='text', keep=False, timeout=10, retry=2)
112
+
113
+ | 参数 | 说明 | 默认值 |
114
+ |------|------|--------|
115
+ | url | 页面 URL | - |
116
+ | mode | 输出模式:`text` `info` `raw` | text |
117
+ | keep | 保持浏览器打开 | False |
118
+ | timeout | 请求超时秒数 | 10 |
119
+ | retry | 重试次数 | 2 |
120
+
121
+ ### search_and_fetch(query, num=10, mode='text', verbose=False, keep=False, tuple_format=False)
122
+
123
+ 一体化搜索 + 抓取,返回列表包含结果信息和抓取内容。
124
+
125
+ ## 输出模式说明
126
+
127
+ | 模式 | 说明 | 适用场景 |
128
+ |------|------|----------|
129
+ | `text` | 纯文本正文 | 人类阅读 |
130
+ | `info` | 结构化 JSON(url, title, text, links, meta) | 数据分析 / agent |
131
+ | `raw` | 原始 HTML | 深度解析 |
132
+
133
+ ## 技术栈
134
+
135
+ | 模块 | 技术 |
136
+ |------|------|
137
+ | HTTP 请求 | DrissionPage (SessionPage) |
138
+ | 动态渲染 | DrissionPage (ChromiumPage) |
139
+ | HTML 解析 | BeautifulSoup4 + lxml |
140
+ | 正文提取 | readability-lxml |
141
+ | 中文分词 | jieba |
142
+ | 排序算法 | rank-bm25 (BM25Okapi) |
143
+
144
+ ## 项目结构
145
+
146
+ ```
147
+ src/skysearch/
148
+ ├── __init__.py # 库入口,导出所有 API
149
+ ├── cli.py # 命令行入口
150
+ ├── search.py # Bing 搜索
151
+ ├── ranker.py # BM25 排序
152
+ ├── api.py # 简洁 API 接口
153
+ └── fetcher/ # 页面抓取包
154
+ ├── __init__.py
155
+ ├── core.py # 核心函数
156
+ ├── session.py # 会话管理
157
+ └── parser.py # HTML 解析
158
+ ```
159
+
160
+ ## License
161
+
162
+ MIT
@@ -0,0 +1,17 @@
1
+ README.md
2
+ pyproject.toml
3
+ src/skysearch/__init__.py
4
+ src/skysearch/api.py
5
+ src/skysearch/cli.py
6
+ src/skysearch/ranker.py
7
+ src/skysearch/search.py
8
+ src/skysearch.egg-info/PKG-INFO
9
+ src/skysearch.egg-info/SOURCES.txt
10
+ src/skysearch.egg-info/dependency_links.txt
11
+ src/skysearch.egg-info/entry_points.txt
12
+ src/skysearch.egg-info/requires.txt
13
+ src/skysearch.egg-info/top_level.txt
14
+ src/skysearch/fetcher/__init__.py
15
+ src/skysearch/fetcher/core.py
16
+ src/skysearch/fetcher/parser.py
17
+ src/skysearch/fetcher/session.py
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ skysearch = skysearch.cli:main
@@ -0,0 +1,6 @@
1
+ DrissionPage
2
+ beautifulsoup4
3
+ lxml
4
+ readability-lxml
5
+ jieba
6
+ rank-bm25
@@ -0,0 +1 @@
1
+ skysearch