crawlo 1.1.5__py3-none-any.whl → 1.1.8__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of crawlo might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: crawlo
3
- Version: 1.1.5
3
+ Version: 1.1.8
4
4
  Summary: Crawlo 是一款基于异步IO的高性能Python爬虫框架,支持分布式抓取。
5
5
  Home-page: https://github.com/crawl-coder/Crawlo.git
6
6
  Author: crawl-coder
@@ -80,10 +80,25 @@ pip install crawlo
80
80
  ### 创建项目
81
81
 
82
82
  ```bash
83
+ # 创建默认项目
83
84
  crawlo startproject myproject
85
+
86
+ # 创建分布式模板项目
87
+ crawlo startproject myproject distributed
88
+
89
+ # 创建项目并选择特定模块
90
+ crawlo startproject myproject --modules mysql,redis,proxy
91
+
84
92
  cd myproject
85
93
  ```
86
94
 
95
+ ### 生成爬虫
96
+
97
+ ```bash
98
+ # 在项目目录中生成爬虫
99
+ crawlo genspider news_spider news.example.com
100
+ ```
101
+
87
102
  ### 编写爬虫
88
103
 
89
104
  ```python
@@ -109,9 +124,158 @@ class MySpider(Spider):
109
124
  ### 运行爬虫
110
125
 
111
126
  ```bash
112
- crawlo crawl myspider
127
+ # 使用命令行工具运行爬虫(推荐)
128
+ crawlo run myspider
129
+
130
+ # 使用项目自带的 run.py 脚本运行
131
+ python run.py
132
+
133
+ # 运行所有爬虫
134
+ crawlo run all
135
+
136
+ # 在项目子目录中也能正确运行
137
+ cd subdirectory
138
+ crawlo run myspider
139
+ ```
140
+
141
+ ## 📜 命令行工具
142
+
143
+ Crawlo 提供了丰富的命令行工具来帮助开发和管理爬虫项目:
144
+
145
+ ### 获取帮助
146
+
147
+ ```bash
148
+ # 显示帮助信息
149
+ crawlo -h
150
+ crawlo --help
151
+ crawlo help
152
+ ```
153
+
154
+ ### crawlo startproject
155
+
156
+ 创建新的爬虫项目。
157
+
158
+ ```bash
159
+ # 基本用法
160
+ crawlo startproject <project_name> [template_type] [--modules module1,module2]
161
+
162
+ # 示例
163
+ crawlo startproject my_spider_project
164
+ crawlo startproject news_crawler simple
165
+ crawlo startproject ecommerce_spider distributed --modules mysql,proxy
166
+ ```
167
+
168
+ **参数说明:**
169
+ - `project_name`: 项目名称(必须是有效的Python标识符)
170
+ - `template_type`: 模板类型(可选)
171
+ - `default`: 默认模板 - 通用配置,适合大多数项目
172
+ - `simple`: 简化模板 - 最小配置,适合快速开始
173
+ - `distributed`: 分布式模板 - 针对分布式爬取优化
174
+ - `high-performance`: 高性能模板 - 针对大规模高并发优化
175
+ - `gentle`: 温和模板 - 低负载配置,对目标网站友好
176
+ - `--modules`: 选择要包含的模块组件(可选)
177
+ - `mysql`: MySQL数据库支持
178
+ - `mongodb`: MongoDB数据库支持
179
+ - `redis`: Redis支持(分布式队列和去重)
180
+ - `proxy`: 代理支持
181
+ - `monitoring`: 监控和性能分析
182
+ - `dedup`: 去重功能
183
+ - `httpx`: HttpX下载器
184
+ - `aiohttp`: AioHttp下载器
185
+ - `curl`: CurlCffi下载器
186
+
187
+ ### crawlo genspider
188
+
189
+ 在现有项目中生成新的爬虫。
190
+
191
+ ```bash
192
+ # 基本用法
193
+ crawlo genspider <spider_name> <domain>
194
+
195
+ # 示例
196
+ crawlo genspider news_spider news.example.com
197
+ crawlo genspider product_spider shop.example.com
198
+ ```
199
+
200
+ **参数说明:**
201
+ - `spider_name`: 爬虫名称(必须是有效的Python标识符)
202
+ - `domain`: 目标域名
203
+
204
+ ### crawlo run
205
+
206
+ 运行爬虫。
207
+
208
+ ```bash
209
+ # 基本用法
210
+ crawlo run <spider_name>|all [--json] [--no-stats]
211
+
212
+ # 示例
213
+ crawlo run myspider
214
+ crawlo run all
215
+ crawlo run all --json --no-stats
113
216
  ```
114
217
 
218
+ **参数说明:**
219
+ - `spider_name`: 要运行的爬虫名称
220
+ - `all`: 运行所有爬虫
221
+ - `--json`: 以JSON格式输出结果
222
+ - `--no-stats`: 不记录统计信息
223
+
224
+ ### crawlo list
225
+
226
+ 列出项目中所有可用的爬虫。
227
+
228
+ ```bash
229
+ # 基本用法
230
+ crawlo list [--json]
231
+
232
+ # 示例
233
+ crawlo list
234
+ crawlo list --json
235
+ ```
236
+
237
+ **参数说明:**
238
+ - `--json`: 以JSON格式输出结果
239
+
240
+ ### crawlo check
241
+
242
+ 检查爬虫定义的合规性。
243
+
244
+ ```bash
245
+ # 基本用法
246
+ crawlo check [--fix] [--ci] [--json] [--watch]
247
+
248
+ # 示例
249
+ crawlo check
250
+ crawlo check --fix
251
+ crawlo check --ci
252
+ crawlo check --watch
253
+ ```
254
+
255
+ **参数说明:**
256
+ - `--fix`: 自动修复常见问题
257
+ - `--ci`: CI模式输出(简洁格式)
258
+ - `--json`: 以JSON格式输出结果
259
+ - `--watch`: 监听模式,文件更改时自动检查
260
+
261
+ ### crawlo stats
262
+
263
+ 查看爬虫运行统计信息。
264
+
265
+ ```bash
266
+ # 基本用法
267
+ crawlo stats [spider_name] [--all]
268
+
269
+ # 示例
270
+ crawlo stats
271
+ crawlo stats myspider
272
+ crawlo stats myspider --all
273
+ ```
274
+
275
+ **参数说明:**
276
+ - `spider_name`: 指定要查看统计信息的爬虫名称
277
+ - `--all`: 显示指定爬虫的所有历史运行记录
278
+
115
279
  ## 🏗️ 架构设计
116
280
 
117
281
  ### 组件交互图
@@ -176,6 +340,7 @@ crawlo crawl myspider
176
340
  │ │ │ - ValidationPipeline │ │ │
177
341
  │ │ │ - ProcessingPipeline │ │ │
178
342
  │ │ │ - StoragePipeline │ │ │
343
+ │ │ │ - DeduplicationPipeline │ │ │
179
344
  │ │ └─────────────────────────┘ │ │
180
345
  │ └──────────────────────────────┘ │
181
346
  └─────────────────────────────────────┘
@@ -229,12 +394,13 @@ crawlo crawl myspider
229
394
  ▼ ▼
230
395
  ┌─────────────────┐ 7.生成数据 ┌─────────────┐
231
396
  │ Processor ├────────────────►│ Pipeline │
232
- └─────────────────┘ └─────────────┘
233
- │ 8.存储数据
234
-
235
- ┌─────────────────┐
236
- │ Items │
237
- └─────────────────┘
397
+ └─────────────────┘ └──────┬──────┘
398
+ │ 8.存储数据 │ 9.去重处理
399
+
400
+ ┌─────────────────┐ ┌─────────────────┐
401
+ │ Items │◄─────────────┤ Deduplication
402
+ └─────────────────┘ │ Pipeline │
403
+ └─────────────────┘
238
404
  ```
239
405
 
240
406
  ### 模块层次结构图
@@ -298,6 +464,8 @@ crawlo/
298
464
  │ ├── pipeline_manager.py # 管道管理器
299
465
  │ ├── base_pipeline.py # 管道基类
300
466
  │ ├── console_pipeline.py # 控制台输出管道
467
+ │ ├── json_pipeline.py # JSON存储管道
468
+ │ ├── redis_dedup_pipeline.py # Redis去重管道
301
469
  │ └── mysql_pipeline.py # MySQL存储管道
302
470
 
303
471
  ├── extension/ # 扩展组件
@@ -335,7 +503,7 @@ crawlo/
335
503
  - **QueueManager**: 统一的队列管理器,支持内存队列和Redis队列的自动切换
336
504
  - **Filter**: 请求去重过滤器,支持内存和Redis两种实现
337
505
  - **Middleware**: 中间件系统,处理请求/响应的预处理和后处理
338
- - **Pipeline**: 数据处理管道,支持多种存储方式(控制台、数据库等)
506
+ - **Pipeline**: 数据处理管道,支持多种存储方式(控制台、数据库等)和去重功能
339
507
  - **Spider**: 爬虫基类,定义爬取逻辑
340
508
 
341
509
  ### 运行模式
@@ -356,12 +524,64 @@ CONCURRENCY = 16
356
524
  DOWNLOAD_DELAY = 1.0
357
525
  QUEUE_TYPE = 'memory' # 单机模式
358
526
  # QUEUE_TYPE = 'redis' # 分布式模式
527
+
528
+ # Redis 配置 (分布式模式下使用)
529
+ REDIS_HOST = 'localhost'
530
+ REDIS_PORT = 6379
531
+ REDIS_DB = 0
532
+ REDIS_PASSWORD = ''
533
+
534
+ # 数据管道配置
535
+ PIPELINES = [
536
+ 'crawlo.pipelines.console_pipeline.ConsolePipeline',
537
+ 'crawlo.pipelines.json_pipeline.JsonPipeline',
538
+ 'crawlo.pipelines.redis_dedup_pipeline.RedisDedupPipeline', # Redis去重管道
539
+ 'crawlo.pipelines.mysql_pipeline.AsyncmyMySQLPipeline', # MySQL存储管道
540
+ ]
359
541
  ```
360
542
 
361
- ### 命令行配置
543
+ ### MySQL 管道配置
362
544
 
545
+ Crawlo 提供了现成的 MySQL 管道实现,可以轻松将爬取的数据存储到 MySQL 数据库中:
546
+
547
+ ```python
548
+ # 在 settings.py 中启用 MySQL 管道
549
+ PIPELINES = [
550
+ 'crawlo.pipelines.mysql_pipeline.AsyncmyMySQLPipeline',
551
+ ]
552
+
553
+ # MySQL 数据库配置
554
+ MYSQL_HOST = 'localhost'
555
+ MYSQL_PORT = 3306
556
+ MYSQL_USER = 'your_username'
557
+ MYSQL_PASSWORD = 'your_password'
558
+ MYSQL_DB = 'your_database'
559
+ MYSQL_TABLE = 'your_table_name'
560
+
561
+ # 可选的批量插入配置
562
+ MYSQL_BATCH_SIZE = 100
563
+ MYSQL_USE_BATCH = True
363
564
  ```
364
- crawlo crawl myspider --concurrency=32 --delay=0.5
565
+
566
+ MySQL 管道特性:
567
+ - **异步操作**:基于 asyncmy 驱动,提供高性能的异步数据库操作
568
+ - **连接池**:自动管理数据库连接,提高效率
569
+ - **批量插入**:支持批量插入以提高性能
570
+ - **事务支持**:确保数据一致性
571
+ - **灵活配置**:支持自定义表名、批量大小等参数
572
+
573
+ ### 命令行配置
574
+
575
+ ``bash
576
+ # 运行单个爬虫
577
+ crawlo run myspider
578
+
579
+ # 运行所有爬虫
580
+ crawlo run all
581
+
582
+ # 在项目子目录中也能正确运行
583
+ cd subdirectory
584
+ crawlo run myspider
365
585
  ```
366
586
 
367
587
  ## 🧩 核心组件
@@ -370,7 +590,11 @@ crawlo crawl myspider --concurrency=32 --delay=0.5
370
590
  灵活的中间件系统,支持请求预处理、响应处理和异常处理。
371
591
 
372
592
  ### 管道系统
373
- 可扩展的数据处理管道,支持多种存储方式(控制台、数据库等)。
593
+ 可扩展的数据处理管道,支持多种存储方式(控制台、数据库等)和去重功能:
594
+ - **ConsolePipeline**: 控制台输出管道
595
+ - **JsonPipeline**: JSON文件存储管道
596
+ - **RedisDedupPipeline**: Redis去重管道,基于Redis集合实现分布式去重
597
+ - **AsyncmyMySQLPipeline**: MySQL数据库存储管道,基于asyncmy驱动
374
598
 
375
599
  ### 扩展组件
376
600
  功能增强扩展,包括日志、监控、性能分析等。
@@ -382,6 +606,7 @@ crawlo crawl myspider --concurrency=32 --delay=0.5
382
606
 
383
607
  - [API数据采集](examples/api_data_collection/) - 简单的API数据采集示例
384
608
  - [电信设备许可证](examples/telecom_licenses_distributed/) - 分布式爬取示例
609
+ - [OFweek分布式爬虫](examples/ofweek_distributed/) - 复杂的分布式爬虫示例,包含Redis去重功能
385
610
 
386
611
  ## 📚 文档
387
612
 
@@ -1,13 +1,13 @@
1
1
  crawlo/__init__.py,sha256=jSOsZbDJ_Q5wZV8onSXx5LgNM7Z1q3zCROGdImBDr2I,1373
2
- crawlo/__version__.py,sha256=neh7i8wZ1x6FcsvBBU2qQY3vJro_j6McS8uQFuJdY2M,23
3
- crawlo/cli.py,sha256=hjAJKx9pba375sATvvcy-dtZyBIgXj8fRBq9RFIZHA4,1206
4
- crawlo/config.py,sha256=JYz4xL2Av5t41Bw90CVS4SSYg18-MXIxwXfKu0WuBjI,9690
2
+ crawlo/__version__.py,sha256=jro_fYFaWTpMLcXlRQe53nd8yxygCDFL1sK3eYFcmKI,23
3
+ crawlo/cli.py,sha256=ZrrOKAvqgGJgqoyakzItt-Jroa9JBF9vQG-1lz4wBPM,2094
4
+ crawlo/config.py,sha256=_pORwcVEgjEhrqVaApu51X0Il3TBK6w9aezGnnqYu8Y,9847
5
5
  crawlo/config_validator.py,sha256=M118EATR-tITzRSe2oSinV5oh2QsooMCkEJ5WS8ma_0,10155
6
6
  crawlo/crawler.py,sha256=24EE7zFPByeYLJnf1K_R9fhJMqaFUjBSa6TuUhlY4TI,37398
7
7
  crawlo/event.py,sha256=ZhoPW5CglCEuZNFEwviSCBIw0pT5O6jT98bqYrDFd3E,324
8
8
  crawlo/exceptions.py,sha256=YVIDnC1bKSMv3fXH_6tinWMuD9HmKHIaUfO4_fkX5sY,1247
9
- crawlo/mode_manager.py,sha256=9AsDGhigYqohqiE3iYscISyRSoANlTdy4WRzXFAMPiI,7283
10
- crawlo/project.py,sha256=DOf_zzdA_A_nilff6Dp5KJXA6KphHYMalAYv336-cO8,5335
9
+ crawlo/mode_manager.py,sha256=q34nX-bMCLCKFfeYy_34VrBT86sL_K030LlLX2H3m0I,7728
10
+ crawlo/project.py,sha256=MH2OYXIZdc9EJolUpUBXTcAgS7qvPP366LkOy62iP_8,6831
11
11
  crawlo/stats_collector.py,sha256=v4jC9BAe-23w93hWzbeMCCgQ9VuFPyxw5JV9ItbGH8w,1636
12
12
  crawlo/subscriber.py,sha256=Aj0kPpbBYlzOb1uViDFraoaThsQEVlqOSYUaFT3jSDs,5136
13
13
  crawlo/task_manager.py,sha256=PScfEB03306Txa0l38AeQ_0WVhKzeWOFyT3bnrkbHW0,849
@@ -15,16 +15,17 @@ crawlo/cleaners/__init__.py,sha256=lxL-ZWDKW-DdobdgKUQ27wNmBiUhGnD0CVG6HWkX3_o,1
15
15
  crawlo/cleaners/data_formatter.py,sha256=iBDHpZBZvn9O7pLkTQilE1TzYJQEc3z3f6HXoVus0f0,7808
16
16
  crawlo/cleaners/encoding_converter.py,sha256=G3khLlk0uBeTwIutsWxVUeSuyc1GMC1BDNJDwsU9ryg,4238
17
17
  crawlo/cleaners/text_cleaner.py,sha256=16e6WqIIb9qANMiK-vCEl4TvgkId19Aa2W1NMLU-jFQ,6707
18
- crawlo/commands/__init__.py,sha256=kZ3qATqDPmMUCNUQSFfBfIA8fp_1dgBwIAWbmFN3_To,355
18
+ crawlo/commands/__init__.py,sha256=orvY6wLOBwGUEJKeF3h_T1fxj8AaQLjngBDd-3xKOE4,392
19
19
  crawlo/commands/check.py,sha256=jW8SgfkOS35j4VS7nRZBZdFCBX9CVFez5LR2sfP_H1U,23437
20
20
  crawlo/commands/genspider.py,sha256=_3GwFMYK79BuKk__5L0ljuwWwOzN80MeuhRkL4Ql11A,5201
21
+ crawlo/commands/help.py,sha256=jJ8GbFJcJVQytPIYsEAMT6v58fNo62qd-G3G3elB-1Q,5011
21
22
  crawlo/commands/list.py,sha256=octTk0QZhapiyM7WgCPersP2v3MesthbJeG9vMqVFOs,5936
22
- crawlo/commands/run.py,sha256=m7SFTxmw4mZJ_eS1a9fHG-c6FvQcRHXfW71xenYBYYc,10809
23
- crawlo/commands/startproject.py,sha256=1oTDgfdIQQBHa9P_1te0siQG4MNeWnAHv_2J7v4a2po,11305
23
+ crawlo/commands/run.py,sha256=Go9hAEUMuG3GphBgemG5S5W4MF39XOxp7-E06rX-pTU,11043
24
+ crawlo/commands/startproject.py,sha256=UYGelGY4dM6Zu3U4G5m8snKqbsfgszhvfpAJLl5b5tM,15772
24
25
  crawlo/commands/stats.py,sha256=iEKdxHoqsJuTkn8zAF9ekBVO1--8__BeD7xohYG5NwE,6252
25
26
  crawlo/commands/utils.py,sha256=b7yW6UlOLprR3gN9oOdhcl3fsCwWRE3-_gDxWz5xhMo,5292
26
27
  crawlo/core/__init__.py,sha256=JYSAn15r8yWgRK_Nc69t_8tZCyb70MiPZKssA8wrYz0,43
27
- crawlo/core/engine.py,sha256=wfuiGJJbEOlSjtTC3yrcugSFnvWQBVhk9A7ynWap-0o,13490
28
+ crawlo/core/engine.py,sha256=B7j88EjqFSy3D1UJENIOKi-j-gWNSxxaOBMC3tQBzK0,13484
28
29
  crawlo/core/processor.py,sha256=oHLs-cno0bJGTNc9NGD2S7_2-grI3ruvggO0SY2mf3Q,1180
29
30
  crawlo/core/scheduler.py,sha256=CdHeVNJbCzRkivGMCiLpVHttJSEbIz5P6qywXAR_cw4,5089
30
31
  crawlo/downloader/__init__.py,sha256=8-r4_Wc_X64FJtKzNQamwsZsc428creKeFo19VxF33o,8565
@@ -43,7 +44,7 @@ crawlo/extension/memory_monitor.py,sha256=fClPchpCkVjcIiU0AJHCKDd7HEiz5B4KqNqKTR
43
44
  crawlo/extension/performance_profiler.py,sha256=BjWD3LOb4VwjQJQvQtWNg7GluEwFquI1CztNfgMzy3c,5032
44
45
  crawlo/extension/request_recorder.py,sha256=KA_RmcfscDxP5wPdolO76yKfRj-1jmHhG3jkVGO1pbc,4181
45
46
  crawlo/filters/__init__.py,sha256=lX-QOCDTiTRFoiK1qrZ5HABo7LgZfcxScx_lELYEvJk,4395
46
- crawlo/filters/aioredis_filter.py,sha256=UVT2ezSnKfYFGn9L0ia512JBoGwoI8djGZ5DDvBT3P8,10173
47
+ crawlo/filters/aioredis_filter.py,sha256=oveE9qFRtK8C9FZGRSZusRakXcAF-B2vkwbpsc0fX1E,10308
47
48
  crawlo/filters/memory_filter.py,sha256=FzGJPhVKfZ8P23kP6de-VSfE8oVMjjpfWzKJIdiMtZU,9529
48
49
  crawlo/items/__init__.py,sha256=rFpx1qFBo0Ik7bSdnXC8EVTJUOQdoJYGVdhYjaH00nk,409
49
50
  crawlo/items/base.py,sha256=hwGJEdFWOdaZfalFX8umRkh_HUWLEbCjvq4j70fplMQ,598
@@ -70,23 +71,23 @@ crawlo/pipelines/json_pipeline.py,sha256=wrCsh8YInmcPLAkhPrHObMx89VZfhf-c7qRrYsT
70
71
  crawlo/pipelines/memory_dedup_pipeline.py,sha256=oQcBODO-I2p6B7Nm_klXvuhzSMIHP-JWwC4_o6Gkgcc,3954
71
72
  crawlo/pipelines/mongo_pipeline.py,sha256=PohTKTGw3QRvuP-T6SrquwW3FAHSno8jQ2D2cH_d75U,5837
72
73
  crawlo/pipelines/mysql_pipeline.py,sha256=RRKMuO-7BTomyRFYCmDpfvjBTcU4SGdjGDV4wBeKWck,13796
73
- crawlo/pipelines/pipeline_manager.py,sha256=Kw37RC2GESWDnDJ6qIN1MA0qc27Uyhu77ebm1r-FgeU,2168
74
- crawlo/pipelines/redis_dedup_pipeline.py,sha256=iniHsb5KGpxkshSFXY9eOKRK8eqVfO62Hzz6kFnPdDQ,6342
74
+ crawlo/pipelines/pipeline_manager.py,sha256=g7lQBllYpKLymDWShjsR4SsXG41XnzD5mUR0UDBt0ZQ,2419
75
+ crawlo/pipelines/redis_dedup_pipeline.py,sha256=U2mms0XBkTgC8_DywIigGboChjUH_l0-cfo_ioYQNsI,6333
75
76
  crawlo/queue/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
76
77
  crawlo/queue/pqueue.py,sha256=qTFOuvEXsYEZbm0ULjsOeZo0XtSsZ-SHpx7nFEtmluE,1095
77
- crawlo/queue/queue_manager.py,sha256=CIYlzOaaN855jP33NHieQ-eciL5vVjYC_bzCUX1GtIY,12501
78
- crawlo/queue/redis_priority_queue.py,sha256=7jB0atlml-g5i0HVyt7_6k-SFYc5ZimcSZgKUKk9EB0,11623
78
+ crawlo/queue/queue_manager.py,sha256=Rz7QI5KfDesmIHzXPSDsksm6ZcoHiyIQYSLkxi2B6Zg,13264
79
+ crawlo/queue/redis_priority_queue.py,sha256=oVvGjkcMpRqJEEBbRlAL16TKcp38Gg9YMdZCUQa7wFI,13081
79
80
  crawlo/settings/__init__.py,sha256=NgYFLfk_Bw7h6KSoepJn_lMBSqVbCHebjKxaE3_eMgw,130
80
- crawlo/settings/default_settings.py,sha256=qzOOVXycjSpUMvx861vE_vD4dZu22psTIBSzYuWqqRo,8908
81
- crawlo/settings/setting_manager.py,sha256=4xXOzKwZCgAp8ybwvVcs2R--CsOD7c6dBIkj6DJHB3c,2998
81
+ crawlo/settings/default_settings.py,sha256=vJ686U69xBkFVYp-I213jB9Z2wQ-fe30Kstm6qQ7aUc,9000
82
+ crawlo/settings/setting_manager.py,sha256=uqgm2uZQp7Wx1hMSf2FnA0dh6f4jzZ13NQbRvIfYwsQ,3829
82
83
  crawlo/spider/__init__.py,sha256=xAH6NfE_6K2aY_VSL9DoGjcmMHJDd5Nxr7TG1Y8vQAE,21091
83
84
  crawlo/templates/crawlo.cfg.tmpl,sha256=lwiUVe5sFixJgHFEjn1OtbAeyWsECOrz37uheuVtulk,240
84
85
  crawlo/templates/project/__init__.py.tmpl,sha256=aQnHaOjMSkTviOC8COUX0fKymuyf8lx2tGduxkMkXEE,61
85
86
  crawlo/templates/project/items.py.tmpl,sha256=8_3DBA8HrS2XbfHzsMZNJiZbFY6fDJUUMFoFti_obJk,314
86
- crawlo/templates/project/middlewares.py.tmpl,sha256=BMTAAFhmZQBn0D3lgWwepcOlXgtzt2lXYEG0XhXFTDI,3885
87
- crawlo/templates/project/pipelines.py.tmpl,sha256=ZIgQjtwMtdP7vyeeg-PSlOnKI46A9d6ew8bGAMx3Dxc,2717
88
- crawlo/templates/project/run.py.tmpl,sha256=IBD0F0XEgBR6gR34PhYAjKiZDdvLufZJkABHapTsoYo,8428
89
- crawlo/templates/project/settings.py.tmpl,sha256=XplJLfRgDdsN_tALmYM_wsDqA8tPd0D1j_UYzHCSxuA,11991
87
+ crawlo/templates/project/middlewares.py.tmpl,sha256=Em7KdWxF3FE5OzXwYxRkQtWr74YvapqhrI8Kij7J6dc,3840
88
+ crawlo/templates/project/pipelines.py.tmpl,sha256=j9oqEhCezmmHlBhMWgYtlgup4jhWnMlv6AEiAOHODkg,2704
89
+ crawlo/templates/project/run.py.tmpl,sha256=yOxpPkyffXLwa7NDx2Y96c8U-QN81_3mqZSuB526DNs,1271
90
+ crawlo/templates/project/settings.py.tmpl,sha256=M2tga44ki-BpgdKY506mcJ-iEw4AMx4pWQKUdJyRAkM,12034
90
91
  crawlo/templates/project/settings_distributed.py.tmpl,sha256=Nli-qR-UB4TXAq4mXjx17y-yAv46NwwpcVidjGeM00A,4321
91
92
  crawlo/templates/project/settings_gentle.py.tmpl,sha256=ljXf9vKV2c-cN8yvf5U4UWsPWrMVmJUfbXladIdS2mg,3320
92
93
  crawlo/templates/project/settings_high_performance.py.tmpl,sha256=3wf8fFYZ5EVE2742JlcwwrPF794vEIEmbxFSbqyGnJQ,5434
@@ -124,6 +125,7 @@ crawlo/utils/system.py,sha256=24zGmtHNhDFMGVo7ftMV-Pqg6_5d63zsyNey9udvJJk,248
124
125
  crawlo/utils/tools.py,sha256=uy7qw5Z1BIhyEgiHENvtM7WoGCJxlS8EX3PmOA7ouCo,275
125
126
  crawlo/utils/url.py,sha256=RKe_iqdjafsNcp-P2GVLYpsL1qbxiuZLiFc-SqOQkcs,1521
126
127
  examples/__init__.py,sha256=NkRbV8_S1tb8S2AW6BE2U6P2-eGOPwMR1k0YQAwQpSE,130
128
+ tests/DOUBLE_CRAWLO_PREFIX_FIX_REPORT.md,sha256=4W6HlT9Uc3cyu77T9pfbkrMxpAZ-xq_L9MU-GbukLV0,3427
127
129
  tests/__init__.py,sha256=409aRX8hsPffiZCVjOogtxwhACzBp8G2UTJyUQSxhK0,136
128
130
  tests/advanced_tools_example.py,sha256=1_iitECKCuWUYMNNGo61l3lmwMRrWdA8F_Xw56UaGZY,9340
129
131
  tests/authenticated_proxy_example.py,sha256=b307_RybOtxXVQK0ITboLvHmOTwIN4yTF9aup4dYF7Q,8477
@@ -144,6 +146,8 @@ tests/test_cleaners.py,sha256=qyd20RNuBHIVHz7X5JjLwlIZedn7yHZ4uB3X78BpaF4,1819
144
146
  tests/test_comprehensive.py,sha256=dvRJeeVYc1cgXK9Y171hH9Y847zZpWSAFFH-EI3UepQ,5182
145
147
  tests/test_config_validator.py,sha256=Ec1h8Mw-fVz1d9JoATIWZb0nTc8pYjhTCSjPm3tvkTQ,6825
146
148
  tests/test_date_tools.py,sha256=pcLDyhLrZ_jh-PhPm4CvLZEgNeH9kLMPKN5zacHwuWM,4053
149
+ tests/test_double_crawlo_fix.py,sha256=qkJugMiEM2KQFHZ_Iza7Y0cER8p-Mecr57zjwHdsaSE,8033
150
+ tests/test_double_crawlo_fix_simple.py,sha256=aPQQRex7ShxC_QJyjRkgelbn0Lnl3fFTwsPH5OglpwM,4807
147
151
  tests/test_dynamic_downloaders_proxy.py,sha256=t_aWpxOHi4h3_fg2ImtIq7IIJ0r3PTHtnXiopPe2ZlM,4450
148
152
  tests/test_dynamic_proxy.py,sha256=zi7Ocbhc9GL1zCs0XhmG2NvBBeIZ2d2hPJVh18lH4Y0,3172
149
153
  tests/test_dynamic_proxy_config.py,sha256=C_9CEjCJtrr0SxIXCyLDhSIi88ujF7UAT1F-FAphd0w,5853
@@ -163,6 +167,7 @@ tests/test_proxy_middleware_integration.py,sha256=mTPK_XvbmLCV_QoVZzA3ybWOOX6149
163
167
  tests/test_proxy_providers.py,sha256=u_R2fhab90vqvQEaOAztpAOe9tJXvUMIdoDxmStmXJ4,1749
164
168
  tests/test_proxy_stats.py,sha256=ES00CEoDITYPFBGPk8pecFzD3ItYIv6NSpcqNd8-kvo,526
165
169
  tests/test_proxy_strategies.py,sha256=9Z1pXmTNyw-eIhGXlf2abZbJx6igLohYq-_3hldQ5uE,1868
170
+ tests/test_queue_manager_double_crawlo.py,sha256=ddnj2BafKj6RZjkK4yAhilyuY__xan7fJYfBSBn2W1M,9346
166
171
  tests/test_queue_manager_redis_key.py,sha256=tvy3qkmB6XNpnJ4SOgjKvxE83hltCdL5Z32CupQ2VZ0,6454
167
172
  tests/test_redis_config.py,sha256=Kbl3PURGNM1BUIspakEOA-ZOl2xxTHb_8KbftwjYOsg,921
168
173
  tests/test_redis_connection_pool.py,sha256=wLPmJ94jUajpShNrnnl4pTbX9ZIGqCEZgzzecasAC4s,9471
@@ -178,8 +183,8 @@ tests/test_template_content.py,sha256=URwjlAzMCdUN0sW_OupUcuSNMxp1OKgW79JOpkLPXn
178
183
  tests/test_template_redis_key.py,sha256=dOFutic8CL3tOzGbYhWbMrYiXZ8R3fhNoF5VKax5Iy0,4946
179
184
  tests/test_tools.py,sha256=fgzXL2L7eBV_nGjeMxH8IMhfc0dviQ80XgzZkJp_4dA,5266
180
185
  tests/tools_example.py,sha256=uXNS4xXJ-OD_xInAn2zjKLG_nlbgVGXZLoJtfhaG9lI,7926
181
- crawlo-1.1.5.dist-info/METADATA,sha256=24KOm7Z64Y-moon13p7n7jV4XBelSFZEfwOs0LtCBDI,20068
182
- crawlo-1.1.5.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
183
- crawlo-1.1.5.dist-info/entry_points.txt,sha256=5HoVoTSPxI8SCa5B7pQYxLSrkOdiunyO9tqNsLMv52g,43
184
- crawlo-1.1.5.dist-info/top_level.txt,sha256=keG_67pbZ_wZL2dmDRA9RMaNHTaV_x_oxZ9DKNgwvR0,22
185
- crawlo-1.1.5.dist-info/RECORD,,
186
+ crawlo-1.1.8.dist-info/METADATA,sha256=nszYOx5Hd5KbjBikYameYCyotd_pDcGLMSsGSxV_YpU,26174
187
+ crawlo-1.1.8.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
188
+ crawlo-1.1.8.dist-info/entry_points.txt,sha256=5HoVoTSPxI8SCa5B7pQYxLSrkOdiunyO9tqNsLMv52g,43
189
+ crawlo-1.1.8.dist-info/top_level.txt,sha256=keG_67pbZ_wZL2dmDRA9RMaNHTaV_x_oxZ9DKNgwvR0,22
190
+ crawlo-1.1.8.dist-info/RECORD,,
@@ -0,0 +1,82 @@
1
+ # 双重 crawlo 前缀问题修复报告
2
+
3
+ ## 问题描述
4
+ 用户在使用分布式爬虫时发现Redis key中出现了双重`crawlo`前缀,例如`crawlo:crawlo:queue:processing:data`。这导致了Redis key命名不一致和潜在的混淆问题。
5
+
6
+ ## 问题分析
7
+ 经过代码分析,发现问题出在以下两个方面:
8
+ 1. RedisPriorityQueue类在处理队列名称时会自动修改用户提供的队列名称
9
+ 2. QueueManager类在提取项目名称时没有正确处理双重`crawlo`前缀的情况
10
+
11
+ ## 修复方案
12
+
13
+ ### 1. RedisPriorityQueue类修复
14
+ 文件:`crawlo/queue/redis_priority_queue.py`
15
+
16
+ **修复前**:
17
+ ```python
18
+ # 如果提供了 queue_name,确保符合命名规范
19
+ # 处理可能的重复前缀问题
20
+ if queue_name.startswith("crawlo:crawlo:"):
21
+ # 修复双重 crawlo 前缀
22
+ self.queue_name = queue_name.replace("crawlo:crawlo:", "crawlo:", 1)
23
+ elif not queue_name.startswith("crawlo:"):
24
+ # 如果没有 crawlo 前缀,添加它
25
+ self.queue_name = f"crawlo:{module_name}:queue:requests"
26
+ else:
27
+ # 已经有正确的 crawlo 前缀
28
+ self.queue_name = queue_name
29
+ ```
30
+
31
+ **修复后**:
32
+ ```python
33
+ # 保持用户提供的队列名称不变,不做修改
34
+ self.queue_name = queue_name
35
+ ```
36
+
37
+ ### 2. QueueManager类修复
38
+ 文件:`crawlo/queue/queue_manager.py`
39
+
40
+ **修复后**:
41
+ ```python
42
+ # 处理可能的双重 crawlo 前缀
43
+ if parts[0] == "crawlo" and parts[1] == "crawlo":
44
+ # 双重 crawlo 前缀,取第三个部分作为项目名称
45
+ if len(parts) >= 3:
46
+ project_name = parts[2]
47
+ else:
48
+ project_name = "default"
49
+ elif parts[0] == "crawlo":
50
+ # 正常的 crawlo 前缀,取第二个部分作为项目名称
51
+ project_name = parts[1]
52
+ else:
53
+ # 没有 crawlo 前缀,使用第一个部分作为项目名称
54
+ project_name = parts[0]
55
+ ```
56
+
57
+ ## 测试验证
58
+
59
+ ### 测试1:Redis队列命名修复测试
60
+ 验证RedisPriorityQueue正确处理各种队列名称格式:
61
+ - 正常命名:`crawlo:test_project:queue:requests` → `crawlo:test_project:queue:requests`
62
+ - 双重 crawlo 前缀:`crawlo:crawlo:queue:requests` → `crawlo:crawlo:queue:requests`
63
+ - 三重 crawlo 前缀:`crawlo:crawlo:crawlo:queue:requests` → `crawlo:crawlo:crawlo:queue:requests`
64
+
65
+ ### 测试2:队列管理器项目名称提取测试
66
+ 验证QueueManager正确提取项目名称:
67
+ - 正常命名:`crawlo:test_project:queue:requests` → `test_project`
68
+ - 双重 crawlo 前缀:`crawlo:crawlo:queue:requests` → [queue](file://d:\dowell\projects\Crawlo\crawlo\core\processor.py#L13-L13)
69
+ - 三重 crawlo 前缀:`crawlo:crawlo:crawlo:queue:requests` → `crawlo`
70
+
71
+ ### 测试3:队列管理器创建队列测试
72
+ 验证整个流程的正确性,确保队列名称在传递过程中保持一致。
73
+
74
+ 所有测试均已通过,表明双重`crawlo`前缀问题已得到解决。
75
+
76
+ ## 结论
77
+ 通过以上修复,我们成功解决了Redis key中出现双重`crawlo`前缀的问题。现在Redis队列名称将保持用户配置的一致性,processing和failed队列也会相应地保持相同的前缀结构。
78
+
79
+ ## 建议
80
+ 1. 建议用户在项目配置中使用标准的队列名称格式,如`crawlo:{project_name}:queue:requests`
81
+ 2. 可以使用Redis key验证工具定期检查和规范Redis key命名
82
+ 3. 如果需要统一的命名规范,可以在项目初始化时明确指定队列名称