vibespider 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 MXFP4
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,249 @@
1
+ Metadata-Version: 2.4
2
+ Name: vibespider
3
+ Version: 0.1.0
4
+ Summary: The Optical-Based Browser Agent
5
+ Author: vibespider contributors
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 MXFP4
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/mx-fp4/vibespider
29
+ Project-URL: Repository, https://github.com/mx-fp4/vibespider
30
+ Keywords: browser,agent,playwright,vision,automation
31
+ Classifier: Development Status :: 3 - Alpha
32
+ Classifier: Intended Audience :: Developers
33
+ Classifier: License :: OSI Approved :: MIT License
34
+ Classifier: Programming Language :: Python :: 3
35
+ Classifier: Programming Language :: Python :: 3.10
36
+ Classifier: Programming Language :: Python :: 3.11
37
+ Classifier: Programming Language :: Python :: 3.12
38
+ Classifier: Topic :: Software Development :: Libraries
39
+ Requires-Python: >=3.10
40
+ Description-Content-Type: text/markdown
41
+ License-File: LICENSE
42
+ Requires-Dist: playwright~=1.58.0
43
+ Dynamic: license-file
44
+
45
+ # vibespider
46
+
47
+ > The Optical-Based Browser Agent
48
+ > 纯视觉驱动的浏览器智能体
49
+
50
+ **核心哲学:I don't read code, I just look.(我不读代码,我只看屏幕)**
51
+
52
+ vibespider 的目标不是“解析网页源码”,而是像人一样“看页面、做判断、再操作”。
53
+
54
+ 它基于 Playwright 驱动浏览器,把页面截图交给多模态大模型(如 Qwen-VL、GPT-4o)理解,再执行点击、输入、滚动、切页等拟人动作。
55
+
56
+ 这种方式对前端混淆、动态渲染、复杂 SPA、反爬脚本更鲁棒:
57
+ - 不依赖脆弱的 CSS/XPath 规则
58
+ - 不依赖页面语义化标签质量
59
+ - 不怕 DOM 结构频繁改版
60
+
61
+ ---
62
+
63
+ ## 为什么是“视觉驱动”
64
+
65
+ 传统爬虫/自动化常见路径:
66
+ - 读 HTML
67
+ - 写选择器
68
+ - 绑定规则
69
+
70
+ 问题是:
71
+ - 页面一改版,规则就断
72
+ - JS 动态内容难稳定抓取
73
+ - 混淆与反自动化策略让维护成本飙升
74
+
75
+ vibespider 的路径:
76
+ 1. 用 Playwright 打开并控制真实浏览器
77
+ 2. 截图当前页面(或局部区域)
78
+ 3. 把截图和任务目标发送给多模态模型
79
+ 4. **由智能体(Agent)综合模型结果生成下一步“动作计划”**(如点击哪个区域、输入什么、是否翻页)
80
+ 5. **由智能体调度执行动作并监控反馈**,进入下一轮观察-决策-行动循环
81
+
82
+ 本质上是把“规则驱动”转成“认知驱动”。
83
+
84
+ ---
85
+
86
+ ## 智能体如何决策与调度
87
+
88
+ 在 vibespider 中,多模态模型负责“看懂画面”,而 **Agent 才是控制中枢**:
89
+
90
+ - **目标分解**:把用户任务拆成可执行的阶段(进入页面、定位目标区、触发动作、校验结果)
91
+ - **动作决策**:结合截图语义、历史动作和当前状态,选择下一步最优操作
92
+ - **执行调度**:统一调度 Playwright 原子动作(click/type/scroll/wait/navigate/retry)
93
+ - **反馈闭环**:执行后重新截图,判断是否达成子目标;失败则重试、换策略或回退
94
+ - **安全约束**:对高风险动作(提交、支付、删除)设置确认策略与保护规则
95
+
96
+ 可以理解为:
97
+ - 模型 = 感知与理解
98
+ - Agent = 决策与编排
99
+ - Playwright = 执行器
100
+
101
+ ---
102
+
103
+ ## 能力边界(设计原则)
104
+
105
+ - **看得见才操作**:只基于屏幕可见信息做决策
106
+ - **拟人化路径**:优先模拟真实用户交互,降低机械行为特征
107
+ - **任务导向**:围绕“完成目标”而不是“抓全站 DOM”
108
+ - **可解释动作链**:每一步都可以输出截图 + 推理 + 执行动作记录
109
+
110
+ ---
111
+
112
+ ## 技术栈
113
+
114
+ - Python 3.10+
115
+ - Playwright(浏览器控制与截图)
116
+ - Multimodal LLM(视觉理解与动作决策)
117
+ - Qwen-VL(含兼容视觉推理接口)
118
+ - GPT-4o(含图像输入能力)
119
+
120
+ ---
121
+
122
+ ## 安装
123
+
124
+ ### 方式一:PyPI 安装(推荐)
125
+
126
+ ```bash
127
+ pip install vibespider
128
+ playwright install
129
+ ```
130
+
131
+ ### 方式二:从源码安装(开发中或本地调试)
132
+
133
+ ```bash
134
+ cd /path/to/vibespider
135
+ pip install -e .
136
+ pip install -r requirements.txt
137
+ playwright install
138
+ ```
139
+
140
+ ---
141
+
142
+ ## 示例调用
143
+
144
+ > 以下示例展示典型 Agent 调用方式:给定目标,让智能体自行决策并调度操作。
145
+
146
+ ```python
147
+ from vibespider import SpiderAgent
148
+
149
+ agent = SpiderAgent(
150
+ model_provider="qwen", # 或 "openai"
151
+ model_name="qwen-vl-max", # 或 "gpt-4o"
152
+ api_key="YOUR_API_KEY",
153
+ headless=False,
154
+ )
155
+
156
+ result = agent.run(
157
+ task="打开某电商搜索页,搜索无线鼠标,提取前10条商品标题和价格",
158
+ start_url="https://example.com",
159
+ output_schema={
160
+ "items": [{"title": "str", "price": "str"}]
161
+ }
162
+ )
163
+
164
+ print(result)
165
+ ```
166
+
167
+ > 当前 `0.1.x` 为最小可用骨架:已提供浏览器拉起、页面访问、全页截图与结果回传;完整的多轮决策调度(Observe → Think → Act)将在后续版本完善。
168
+
169
+ ---
170
+
171
+ ## 本地开发初始化(当前骨架)
172
+
173
+ 当前仓库处于早期阶段,建议按以下方式初始化你的实验环境:
174
+
175
+ ```bash
176
+ python -m venv venv
177
+ source venv/bin/activate
178
+ pip install -r requirements.txt
179
+ playwright install
180
+ ```
181
+
182
+ 后续典型调用流程将是:
183
+ 1. 配置模型提供方(Qwen-VL / GPT-4o)与 API Key
184
+ 2. 指定任务目标(例如“进入搜索页并提取前 10 条结果标题”)
185
+ 3. 启动视觉循环(截图 → 推理 → 动作)
186
+ 4. 输出结构化结果和过程日志
187
+
188
+ ---
189
+
190
+ ## 打包与发布(PyPI)
191
+
192
+ ```bash
193
+ python -m pip install --upgrade build twine
194
+ python -m build
195
+ python -m twine upload dist/*
196
+ ```
197
+
198
+ 发布前建议先检查包内容:
199
+
200
+ ```bash
201
+ python -m twine check dist/*
202
+ ```
203
+
204
+ ---
205
+
206
+ ## 适用场景
207
+
208
+ - 高动态页面的数据采集
209
+ - 经常改版站点的稳定自动化
210
+ - 复杂交互流程(登录后、多步跳转、弹窗/浮层)
211
+ - 需要“像人一样浏览”的任务编排
212
+
213
+ ---
214
+
215
+ ## 与传统方案对比
216
+
217
+ | 维度 | 传统规则爬虫 | vibespider(视觉智能体) |
218
+ |---|---|---|
219
+ | 依赖 | DOM/选择器 | 屏幕视觉 + 多模态推理 |
220
+ | 抗改版能力 | 较弱 | 较强 |
221
+ | 面对混淆 | 易失效 | 更鲁棒 |
222
+ | 维护方式 | 改规则 | 调任务策略与提示词 |
223
+
224
+ ---
225
+
226
+ ## Roadmap
227
+
228
+ - [ ] Agent 循环核心(Observe → Think → Act)
229
+ - [ ] Qwen-VL / GPT-4o 统一模型适配层
230
+ - [ ] 动作 DSL(点击、输入、滚动、等待、重试)
231
+ - [ ] 截图与决策日志回放
232
+ - [ ] 失败恢复与自纠错策略
233
+ - [ ] 结构化提取器(JSON Schema 输出)
234
+
235
+ ---
236
+
237
+ ## 注意事项
238
+
239
+ - 请遵守目标网站的服务条款、robots 协议与当地法律法规。
240
+ - 本项目用于自动化研究与工程实践,不鼓励滥用。
241
+ - 涉及账号、支付、隐私数据时,请开启最小权限与审计日志。
242
+
243
+ ---
244
+
245
+ ## 项目愿景
246
+
247
+ 让网页自动化从“写选择器”进化到“看屏幕做决策”,
248
+ 把浏览器 Agent 真正带入视觉智能时代。
249
+
@@ -0,0 +1,205 @@
1
+ # vibespider
2
+
3
+ > The Optical-Based Browser Agent
4
+ > 纯视觉驱动的浏览器智能体
5
+
6
+ **核心哲学:I don't read code, I just look.(我不读代码,我只看屏幕)**
7
+
8
+ vibespider 的目标不是“解析网页源码”,而是像人一样“看页面、做判断、再操作”。
9
+
10
+ 它基于 Playwright 驱动浏览器,把页面截图交给多模态大模型(如 Qwen-VL、GPT-4o)理解,再执行点击、输入、滚动、切页等拟人动作。
11
+
12
+ 这种方式对前端混淆、动态渲染、复杂 SPA、反爬脚本更鲁棒:
13
+ - 不依赖脆弱的 CSS/XPath 规则
14
+ - 不依赖页面语义化标签质量
15
+ - 不怕 DOM 结构频繁改版
16
+
17
+ ---
18
+
19
+ ## 为什么是“视觉驱动”
20
+
21
+ 传统爬虫/自动化常见路径:
22
+ - 读 HTML
23
+ - 写选择器
24
+ - 绑定规则
25
+
26
+ 问题是:
27
+ - 页面一改版,规则就断
28
+ - JS 动态内容难稳定抓取
29
+ - 混淆与反自动化策略让维护成本飙升
30
+
31
+ vibespider 的路径:
32
+ 1. 用 Playwright 打开并控制真实浏览器
33
+ 2. 截图当前页面(或局部区域)
34
+ 3. 把截图和任务目标发送给多模态模型
35
+ 4. **由智能体(Agent)综合模型结果生成下一步“动作计划”**(如点击哪个区域、输入什么、是否翻页)
36
+ 5. **由智能体调度执行动作并监控反馈**,进入下一轮观察-决策-行动循环
37
+
38
+ 本质上是把“规则驱动”转成“认知驱动”。
39
+
40
+ ---
41
+
42
+ ## 智能体如何决策与调度
43
+
44
+ 在 vibespider 中,多模态模型负责“看懂画面”,而 **Agent 才是控制中枢**:
45
+
46
+ - **目标分解**:把用户任务拆成可执行的阶段(进入页面、定位目标区、触发动作、校验结果)
47
+ - **动作决策**:结合截图语义、历史动作和当前状态,选择下一步最优操作
48
+ - **执行调度**:统一调度 Playwright 原子动作(click/type/scroll/wait/navigate/retry)
49
+ - **反馈闭环**:执行后重新截图,判断是否达成子目标;失败则重试、换策略或回退
50
+ - **安全约束**:对高风险动作(提交、支付、删除)设置确认策略与保护规则
51
+
52
+ 可以理解为:
53
+ - 模型 = 感知与理解
54
+ - Agent = 决策与编排
55
+ - Playwright = 执行器
56
+
57
+ ---
58
+
59
+ ## 能力边界(设计原则)
60
+
61
+ - **看得见才操作**:只基于屏幕可见信息做决策
62
+ - **拟人化路径**:优先模拟真实用户交互,降低机械行为特征
63
+ - **任务导向**:围绕“完成目标”而不是“抓全站 DOM”
64
+ - **可解释动作链**:每一步都可以输出截图 + 推理 + 执行动作记录
65
+
66
+ ---
67
+
68
+ ## 技术栈
69
+
70
+ - Python 3.10+
71
+ - Playwright(浏览器控制与截图)
72
+ - Multimodal LLM(视觉理解与动作决策)
73
+ - Qwen-VL(含兼容视觉推理接口)
74
+ - GPT-4o(含图像输入能力)
75
+
76
+ ---
77
+
78
+ ## 安装
79
+
80
+ ### 方式一:PyPI 安装(推荐)
81
+
82
+ ```bash
83
+ pip install vibespider
84
+ playwright install
85
+ ```
86
+
87
+ ### 方式二:从源码安装(开发中或本地调试)
88
+
89
+ ```bash
90
+ cd /path/to/vibespider
91
+ pip install -e .
92
+ pip install -r requirements.txt
93
+ playwright install
94
+ ```
95
+
96
+ ---
97
+
98
+ ## 示例调用
99
+
100
+ > 以下示例展示典型 Agent 调用方式:给定目标,让智能体自行决策并调度操作。
101
+
102
+ ```python
103
+ from vibespider import SpiderAgent
104
+
105
+ agent = SpiderAgent(
106
+ model_provider="qwen", # 或 "openai"
107
+ model_name="qwen-vl-max", # 或 "gpt-4o"
108
+ api_key="YOUR_API_KEY",
109
+ headless=False,
110
+ )
111
+
112
+ result = agent.run(
113
+ task="打开某电商搜索页,搜索无线鼠标,提取前10条商品标题和价格",
114
+ start_url="https://example.com",
115
+ output_schema={
116
+ "items": [{"title": "str", "price": "str"}]
117
+ }
118
+ )
119
+
120
+ print(result)
121
+ ```
122
+
123
+ > 当前 `0.1.x` 为最小可用骨架:已提供浏览器拉起、页面访问、全页截图与结果回传;完整的多轮决策调度(Observe → Think → Act)将在后续版本完善。
124
+
125
+ ---
126
+
127
+ ## 本地开发初始化(当前骨架)
128
+
129
+ 当前仓库处于早期阶段,建议按以下方式初始化你的实验环境:
130
+
131
+ ```bash
132
+ python -m venv venv
133
+ source venv/bin/activate
134
+ pip install -r requirements.txt
135
+ playwright install
136
+ ```
137
+
138
+ 后续典型调用流程将是:
139
+ 1. 配置模型提供方(Qwen-VL / GPT-4o)与 API Key
140
+ 2. 指定任务目标(例如“进入搜索页并提取前 10 条结果标题”)
141
+ 3. 启动视觉循环(截图 → 推理 → 动作)
142
+ 4. 输出结构化结果和过程日志
143
+
144
+ ---
145
+
146
+ ## 打包与发布(PyPI)
147
+
148
+ ```bash
149
+ python -m pip install --upgrade build twine
150
+ python -m build
151
+ python -m twine upload dist/*
152
+ ```
153
+
154
+ 发布前建议先检查包内容:
155
+
156
+ ```bash
157
+ python -m twine check dist/*
158
+ ```
159
+
160
+ ---
161
+
162
+ ## 适用场景
163
+
164
+ - 高动态页面的数据采集
165
+ - 经常改版站点的稳定自动化
166
+ - 复杂交互流程(登录后、多步跳转、弹窗/浮层)
167
+ - 需要“像人一样浏览”的任务编排
168
+
169
+ ---
170
+
171
+ ## 与传统方案对比
172
+
173
+ | 维度 | 传统规则爬虫 | vibespider(视觉智能体) |
174
+ |---|---|---|
175
+ | 依赖 | DOM/选择器 | 屏幕视觉 + 多模态推理 |
176
+ | 抗改版能力 | 较弱 | 较强 |
177
+ | 面对混淆 | 易失效 | 更鲁棒 |
178
+ | 维护方式 | 改规则 | 调任务策略与提示词 |
179
+
180
+ ---
181
+
182
+ ## Roadmap
183
+
184
+ - [ ] Agent 循环核心(Observe → Think → Act)
185
+ - [ ] Qwen-VL / GPT-4o 统一模型适配层
186
+ - [ ] 动作 DSL(点击、输入、滚动、等待、重试)
187
+ - [ ] 截图与决策日志回放
188
+ - [ ] 失败恢复与自纠错策略
189
+ - [ ] 结构化提取器(JSON Schema 输出)
190
+
191
+ ---
192
+
193
+ ## 注意事项
194
+
195
+ - 请遵守目标网站的服务条款、robots 协议与当地法律法规。
196
+ - 本项目用于自动化研究与工程实践,不鼓励滥用。
197
+ - 涉及账号、支付、隐私数据时,请开启最小权限与审计日志。
198
+
199
+ ---
200
+
201
+ ## 项目愿景
202
+
203
+ 让网页自动化从“写选择器”进化到“看屏幕做决策”,
204
+ 把浏览器 Agent 真正带入视觉智能时代。
205
+
@@ -0,0 +1,31 @@
1
+ [build-system]
2
+ requires = ["setuptools>=68", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "vibespider"
7
+ version = "0.1.0"
8
+ description = "The Optical-Based Browser Agent"
9
+ readme = "README.md"
10
+ requires-python = ">=3.10"
11
+ license = { file = "LICENSE" }
12
+ authors = [{ name = "vibespider contributors" }]
13
+ dependencies = ["playwright~=1.58.0"]
14
+ keywords = ["browser", "agent", "playwright", "vision", "automation"]
15
+ classifiers = [
16
+ "Development Status :: 3 - Alpha",
17
+ "Intended Audience :: Developers",
18
+ "License :: OSI Approved :: MIT License",
19
+ "Programming Language :: Python :: 3",
20
+ "Programming Language :: Python :: 3.10",
21
+ "Programming Language :: Python :: 3.11",
22
+ "Programming Language :: Python :: 3.12",
23
+ "Topic :: Software Development :: Libraries",
24
+ ]
25
+
26
+ [project.urls]
27
+ Homepage = "https://github.com/mx-fp4/vibespider"
28
+ Repository = "https://github.com/mx-fp4/vibespider"
29
+
30
+ [tool.setuptools.packages.find]
31
+ include = ["vibespider*"]
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,4 @@
1
+ from .agent import SpiderAgent
2
+
3
+ __all__ = ["SpiderAgent"]
4
+ __version__ = "0.1.0"
@@ -0,0 +1,54 @@
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass
4
+ from pathlib import Path
5
+ from typing import Any
6
+
7
+ from playwright.sync_api import sync_playwright
8
+
9
+
10
+ @dataclass
11
+ class SpiderAgent:
12
+ model_provider: str = "openai"
13
+ model_name: str = "gpt-4o"
14
+ api_key: str | None = None
15
+ headless: bool = True
16
+ viewport_width: int = 1440
17
+ viewport_height: int = 900
18
+
19
+ def run(
20
+ self,
21
+ task: str,
22
+ start_url: str,
23
+ output_schema: dict[str, Any] | None = None,
24
+ screenshot_path: str = "artifacts/last.png",
25
+ ) -> dict[str, Any]:
26
+ output_file = Path(screenshot_path)
27
+ output_file.parent.mkdir(parents=True, exist_ok=True)
28
+
29
+ with sync_playwright() as playwright:
30
+ browser = playwright.chromium.launch(headless=self.headless)
31
+ context = browser.new_context(
32
+ viewport={"width": self.viewport_width, "height": self.viewport_height}
33
+ )
34
+ page = context.new_page()
35
+ page.goto(start_url, wait_until="domcontentloaded")
36
+ page.screenshot(path=str(output_file), full_page=True)
37
+ title = page.title()
38
+ current_url = page.url
39
+ browser.close()
40
+
41
+ return {
42
+ "status": "ok",
43
+ "task": task,
44
+ "model": {
45
+ "provider": self.model_provider,
46
+ "name": self.model_name,
47
+ },
48
+ "start_url": start_url,
49
+ "current_url": current_url,
50
+ "page_title": title,
51
+ "screenshot": str(output_file),
52
+ "output_schema": output_schema,
53
+ "note": "MVP skeleton: capture + observe only, decision loop will come in next versions.",
54
+ }
@@ -0,0 +1,249 @@
1
+ Metadata-Version: 2.4
2
+ Name: vibespider
3
+ Version: 0.1.0
4
+ Summary: The Optical-Based Browser Agent
5
+ Author: vibespider contributors
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 MXFP4
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/mx-fp4/vibespider
29
+ Project-URL: Repository, https://github.com/mx-fp4/vibespider
30
+ Keywords: browser,agent,playwright,vision,automation
31
+ Classifier: Development Status :: 3 - Alpha
32
+ Classifier: Intended Audience :: Developers
33
+ Classifier: License :: OSI Approved :: MIT License
34
+ Classifier: Programming Language :: Python :: 3
35
+ Classifier: Programming Language :: Python :: 3.10
36
+ Classifier: Programming Language :: Python :: 3.11
37
+ Classifier: Programming Language :: Python :: 3.12
38
+ Classifier: Topic :: Software Development :: Libraries
39
+ Requires-Python: >=3.10
40
+ Description-Content-Type: text/markdown
41
+ License-File: LICENSE
42
+ Requires-Dist: playwright~=1.58.0
43
+ Dynamic: license-file
44
+
45
+ # vibespider
46
+
47
+ > The Optical-Based Browser Agent
48
+ > 纯视觉驱动的浏览器智能体
49
+
50
+ **核心哲学:I don't read code, I just look.(我不读代码,我只看屏幕)**
51
+
52
+ vibespider 的目标不是“解析网页源码”,而是像人一样“看页面、做判断、再操作”。
53
+
54
+ 它基于 Playwright 驱动浏览器,把页面截图交给多模态大模型(如 Qwen-VL、GPT-4o)理解,再执行点击、输入、滚动、切页等拟人动作。
55
+
56
+ 这种方式对前端混淆、动态渲染、复杂 SPA、反爬脚本更鲁棒:
57
+ - 不依赖脆弱的 CSS/XPath 规则
58
+ - 不依赖页面语义化标签质量
59
+ - 不怕 DOM 结构频繁改版
60
+
61
+ ---
62
+
63
+ ## 为什么是“视觉驱动”
64
+
65
+ 传统爬虫/自动化常见路径:
66
+ - 读 HTML
67
+ - 写选择器
68
+ - 绑定规则
69
+
70
+ 问题是:
71
+ - 页面一改版,规则就断
72
+ - JS 动态内容难稳定抓取
73
+ - 混淆与反自动化策略让维护成本飙升
74
+
75
+ vibespider 的路径:
76
+ 1. 用 Playwright 打开并控制真实浏览器
77
+ 2. 截图当前页面(或局部区域)
78
+ 3. 把截图和任务目标发送给多模态模型
79
+ 4. **由智能体(Agent)综合模型结果生成下一步“动作计划”**(如点击哪个区域、输入什么、是否翻页)
80
+ 5. **由智能体调度执行动作并监控反馈**,进入下一轮观察-决策-行动循环
81
+
82
+ 本质上是把“规则驱动”转成“认知驱动”。
83
+
84
+ ---
85
+
86
+ ## 智能体如何决策与调度
87
+
88
+ 在 vibespider 中,多模态模型负责“看懂画面”,而 **Agent 才是控制中枢**:
89
+
90
+ - **目标分解**:把用户任务拆成可执行的阶段(进入页面、定位目标区、触发动作、校验结果)
91
+ - **动作决策**:结合截图语义、历史动作和当前状态,选择下一步最优操作
92
+ - **执行调度**:统一调度 Playwright 原子动作(click/type/scroll/wait/navigate/retry)
93
+ - **反馈闭环**:执行后重新截图,判断是否达成子目标;失败则重试、换策略或回退
94
+ - **安全约束**:对高风险动作(提交、支付、删除)设置确认策略与保护规则
95
+
96
+ 可以理解为:
97
+ - 模型 = 感知与理解
98
+ - Agent = 决策与编排
99
+ - Playwright = 执行器
100
+
101
+ ---
102
+
103
+ ## 能力边界(设计原则)
104
+
105
+ - **看得见才操作**:只基于屏幕可见信息做决策
106
+ - **拟人化路径**:优先模拟真实用户交互,降低机械行为特征
107
+ - **任务导向**:围绕“完成目标”而不是“抓全站 DOM”
108
+ - **可解释动作链**:每一步都可以输出截图 + 推理 + 执行动作记录
109
+
110
+ ---
111
+
112
+ ## 技术栈
113
+
114
+ - Python 3.10+
115
+ - Playwright(浏览器控制与截图)
116
+ - Multimodal LLM(视觉理解与动作决策)
117
+ - Qwen-VL(含兼容视觉推理接口)
118
+ - GPT-4o(含图像输入能力)
119
+
120
+ ---
121
+
122
+ ## 安装
123
+
124
+ ### 方式一:PyPI 安装(推荐)
125
+
126
+ ```bash
127
+ pip install vibespider
128
+ playwright install
129
+ ```
130
+
131
+ ### 方式二:从源码安装(开发中或本地调试)
132
+
133
+ ```bash
134
+ cd /path/to/vibespider
135
+ pip install -e .
136
+ pip install -r requirements.txt
137
+ playwright install
138
+ ```
139
+
140
+ ---
141
+
142
+ ## 示例调用
143
+
144
+ > 以下示例展示典型 Agent 调用方式:给定目标,让智能体自行决策并调度操作。
145
+
146
+ ```python
147
+ from vibespider import SpiderAgent
148
+
149
+ agent = SpiderAgent(
150
+ model_provider="qwen", # 或 "openai"
151
+ model_name="qwen-vl-max", # 或 "gpt-4o"
152
+ api_key="YOUR_API_KEY",
153
+ headless=False,
154
+ )
155
+
156
+ result = agent.run(
157
+ task="打开某电商搜索页,搜索无线鼠标,提取前10条商品标题和价格",
158
+ start_url="https://example.com",
159
+ output_schema={
160
+ "items": [{"title": "str", "price": "str"}]
161
+ }
162
+ )
163
+
164
+ print(result)
165
+ ```
166
+
167
+ > 当前 `0.1.x` 为最小可用骨架:已提供浏览器拉起、页面访问、全页截图与结果回传;完整的多轮决策调度(Observe → Think → Act)将在后续版本完善。
168
+
169
+ ---
170
+
171
+ ## 本地开发初始化(当前骨架)
172
+
173
+ 当前仓库处于早期阶段,建议按以下方式初始化你的实验环境:
174
+
175
+ ```bash
176
+ python -m venv venv
177
+ source venv/bin/activate
178
+ pip install -r requirements.txt
179
+ playwright install
180
+ ```
181
+
182
+ 后续典型调用流程将是:
183
+ 1. 配置模型提供方(Qwen-VL / GPT-4o)与 API Key
184
+ 2. 指定任务目标(例如“进入搜索页并提取前 10 条结果标题”)
185
+ 3. 启动视觉循环(截图 → 推理 → 动作)
186
+ 4. 输出结构化结果和过程日志
187
+
188
+ ---
189
+
190
+ ## 打包与发布(PyPI)
191
+
192
+ ```bash
193
+ python -m pip install --upgrade build twine
194
+ python -m build
195
+ python -m twine upload dist/*
196
+ ```
197
+
198
+ 发布前建议先检查包内容:
199
+
200
+ ```bash
201
+ python -m twine check dist/*
202
+ ```
203
+
204
+ ---
205
+
206
+ ## 适用场景
207
+
208
+ - 高动态页面的数据采集
209
+ - 经常改版站点的稳定自动化
210
+ - 复杂交互流程(登录后、多步跳转、弹窗/浮层)
211
+ - 需要“像人一样浏览”的任务编排
212
+
213
+ ---
214
+
215
+ ## 与传统方案对比
216
+
217
+ | 维度 | 传统规则爬虫 | vibespider(视觉智能体) |
218
+ |---|---|---|
219
+ | 依赖 | DOM/选择器 | 屏幕视觉 + 多模态推理 |
220
+ | 抗改版能力 | 较弱 | 较强 |
221
+ | 面对混淆 | 易失效 | 更鲁棒 |
222
+ | 维护方式 | 改规则 | 调任务策略与提示词 |
223
+
224
+ ---
225
+
226
+ ## Roadmap
227
+
228
+ - [ ] Agent 循环核心(Observe → Think → Act)
229
+ - [ ] Qwen-VL / GPT-4o 统一模型适配层
230
+ - [ ] 动作 DSL(点击、输入、滚动、等待、重试)
231
+ - [ ] 截图与决策日志回放
232
+ - [ ] 失败恢复与自纠错策略
233
+ - [ ] 结构化提取器(JSON Schema 输出)
234
+
235
+ ---
236
+
237
+ ## 注意事项
238
+
239
+ - 请遵守目标网站的服务条款、robots 协议与当地法律法规。
240
+ - 本项目用于自动化研究与工程实践,不鼓励滥用。
241
+ - 涉及账号、支付、隐私数据时,请开启最小权限与审计日志。
242
+
243
+ ---
244
+
245
+ ## 项目愿景
246
+
247
+ 让网页自动化从“写选择器”进化到“看屏幕做决策”,
248
+ 把浏览器 Agent 真正带入视觉智能时代。
249
+
@@ -0,0 +1,10 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ vibespider/__init__.py
5
+ vibespider/agent.py
6
+ vibespider.egg-info/PKG-INFO
7
+ vibespider.egg-info/SOURCES.txt
8
+ vibespider.egg-info/dependency_links.txt
9
+ vibespider.egg-info/requires.txt
10
+ vibespider.egg-info/top_level.txt
@@ -0,0 +1 @@
1
+ playwright~=1.58.0
@@ -0,0 +1 @@
1
+ vibespider