mindforge-cc 11.5.1 → 11.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agent/mindforge/skill-tdd.md +53 -0
- package/.agent/mindforge/skills-index.md +118 -0
- package/.agent/mindforge/systematic-debug.md +60 -0
- package/.agent/mindforge/wf-catalog.md +37 -0
- package/.agent/mindforge/wf-code-audit.md +31 -0
- package/.agent/mindforge/wf-competitive-analysis.md +31 -0
- package/.agent/mindforge/wf-deep-research.md +32 -0
- package/.agent/mindforge/wf-feature-planner.md +31 -0
- package/.agent/mindforge/wf-incident-response.md +31 -0
- package/.agent/mindforge/wf-onboard-codebase.md +31 -0
- package/.agent/mindforge/wf-perf-optimize.md +31 -0
- package/.agent/mindforge/wf-pr-review.md +31 -0
- package/.agent/mindforge/wf-refactor-plan.md +31 -0
- package/.agent/mindforge/wf-release-prep.md +31 -0
- package/.agent/mindforge/wf-tdd-sprint.md +31 -0
- package/.agent/mindforge/wf-tech-evaluation.md +31 -0
- package/.agent/skills/1password-skill/SKILL.md +156 -0
- package/.agent/skills/1password-skill/references/cli-examples.md +31 -0
- package/.agent/skills/1password-skill/references/get-started.md +21 -0
- package/.agent/skills/article-illustrator/SKILL.md +199 -0
- package/.agent/skills/article-illustrator/references/prompt-construction.md +426 -0
- package/.agent/skills/article-illustrator/references/style-presets.md +80 -0
- package/.agent/skills/article-illustrator/references/styles.md +224 -0
- package/.agent/skills/article-illustrator/references/usage.md +50 -0
- package/.agent/skills/article-illustrator/references/workflow.md +332 -0
- package/.agent/skills/arxiv/SKILL.md +275 -0
- package/.agent/skills/blogwatcher/SKILL.md +130 -0
- package/.agent/skills/code-wiki/SKILL.md +438 -0
- package/.agent/skills/code-wiki/templates/README.md +31 -0
- package/.agent/skills/code-wiki/templates/architecture.md +30 -0
- package/.agent/skills/code-wiki/templates/getting-started.md +47 -0
- package/.agent/skills/code-wiki/templates/module.md +38 -0
- package/.agent/skills/codebase-inspection/SKILL.md +109 -0
- package/.agent/skills/comic-creator/SKILL.md +240 -0
- package/.agent/skills/comic-creator/references/analysis-framework.md +176 -0
- package/.agent/skills/comic-creator/references/auto-selection.md +71 -0
- package/.agent/skills/comic-creator/references/base-prompt.md +98 -0
- package/.agent/skills/comic-creator/references/character-template.md +180 -0
- package/.agent/skills/comic-creator/references/ohmsha-guide.md +85 -0
- package/.agent/skills/comic-creator/references/partial-workflows.md +106 -0
- package/.agent/skills/comic-creator/references/storyboard-template.md +143 -0
- package/.agent/skills/comic-creator/references/workflow.md +401 -0
- package/.agent/skills/concept-diagrams/SKILL.md +355 -0
- package/.agent/skills/concept-diagrams/references/dashboard-patterns.md +43 -0
- package/.agent/skills/concept-diagrams/references/infrastructure-patterns.md +144 -0
- package/.agent/skills/concept-diagrams/references/physical-shape-cookbook.md +42 -0
- package/.agent/skills/creative-ideation/SKILL.md +144 -0
- package/.agent/skills/creative-ideation/references/full-prompt-library.md +110 -0
- package/.agent/skills/devops-cli/SKILL.md +149 -0
- package/.agent/skills/devops-cli/references/app-discovery.md +112 -0
- package/.agent/skills/devops-cli/references/authentication.md +59 -0
- package/.agent/skills/devops-cli/references/cli-reference.md +104 -0
- package/.agent/skills/devops-cli/references/running-apps.md +171 -0
- package/.agent/skills/devops-watchers/SKILL.md +103 -0
- package/.agent/skills/docker-management/SKILL.md +273 -0
- package/.agent/skills/domain-intel/SKILL.md +96 -0
- package/.agent/skills/duckduckgo-search/SKILL.md +230 -0
- package/.agent/skills/github-auth/SKILL.md +240 -0
- package/.agent/skills/github-code-review/SKILL.md +474 -0
- package/.agent/skills/github-code-review/references/review-output-template.md +74 -0
- package/.agent/skills/github-issues/SKILL.md +363 -0
- package/.agent/skills/github-issues/templates/bug-report.md +35 -0
- package/.agent/skills/github-issues/templates/feature-request.md +31 -0
- package/.agent/skills/github-pr-workflow/SKILL.md +360 -0
- package/.agent/skills/github-pr-workflow/references/ci-troubleshooting.md +183 -0
- package/.agent/skills/github-pr-workflow/references/conventional-commits.md +71 -0
- package/.agent/skills/github-pr-workflow/templates/pr-body-bugfix.md +35 -0
- package/.agent/skills/github-pr-workflow/templates/pr-body-feature.md +33 -0
- package/.agent/skills/github-repo-management/SKILL.md +509 -0
- package/.agent/skills/github-repo-management/references/github-api-cheatsheet.md +161 -0
- package/.agent/skills/godmode/SKILL.md +396 -0
- package/.agent/skills/godmode/references/jailbreak-templates.md +128 -0
- package/.agent/skills/godmode/references/refusal-detection.md +142 -0
- package/.agent/skills/hyperframes/SKILL.md +182 -0
- package/.agent/skills/hyperframes/references/cli.md +185 -0
- package/.agent/skills/hyperframes/references/composition.md +129 -0
- package/.agent/skills/hyperframes/references/features.md +289 -0
- package/.agent/skills/hyperframes/references/gsap.md +136 -0
- package/.agent/skills/hyperframes/references/troubleshooting.md +137 -0
- package/.agent/skills/hyperframes/references/website-to-video.md +145 -0
- package/.agent/skills/jupyter-live-kernel/SKILL.md +160 -0
- package/.agent/skills/kanban-orchestrator/SKILL.md +209 -0
- package/.agent/skills/kanban-worker/SKILL.md +188 -0
- package/.agent/skills/llm-wiki/SKILL.md +499 -0
- package/.agent/skills/meme-generation/SKILL.md +122 -0
- package/.agent/skills/node-inspect-debugger/SKILL.md +312 -0
- package/.agent/skills/obsidian/SKILL.md +60 -0
- package/.agent/skills/osint-investigation/SKILL.md +269 -0
- package/.agent/skills/osint-investigation/templates/source-template.md +59 -0
- package/.agent/skills/oss-forensics/SKILL.md +422 -0
- package/.agent/skills/oss-forensics/references/evidence-types.md +89 -0
- package/.agent/skills/oss-forensics/references/github-archive-guide.md +184 -0
- package/.agent/skills/oss-forensics/references/investigation-templates.md +131 -0
- package/.agent/skills/oss-forensics/references/recovery-techniques.md +164 -0
- package/.agent/skills/oss-forensics/templates/forensic-report.md +151 -0
- package/.agent/skills/oss-forensics/templates/malicious-package-report.md +43 -0
- package/.agent/skills/parallel-cli/SKILL.md +384 -0
- package/.agent/skills/pinggy-tunnel/SKILL.md +302 -0
- package/.agent/skills/pixel-art/SKILL.md +209 -0
- package/.agent/skills/pixel-art/references/palettes.md +49 -0
- package/.agent/skills/plan/SKILL.md +331 -0
- package/.agent/skills/polymarket/SKILL.md +75 -0
- package/.agent/skills/polymarket/references/api-endpoints.md +220 -0
- package/.agent/skills/python-debugpy/SKILL.md +368 -0
- package/.agent/skills/requesting-code-review/SKILL.md +273 -0
- package/.agent/skills/research-paper-writing/SKILL.md +2367 -0
- package/.agent/skills/research-paper-writing/references/autoreason-methodology.md +394 -0
- package/.agent/skills/research-paper-writing/references/checklists.md +434 -0
- package/.agent/skills/research-paper-writing/references/citation-workflow.md +563 -0
- package/.agent/skills/research-paper-writing/references/experiment-patterns.md +728 -0
- package/.agent/skills/research-paper-writing/references/human-evaluation.md +476 -0
- package/.agent/skills/research-paper-writing/references/paper-types.md +481 -0
- package/.agent/skills/research-paper-writing/references/reviewer-guidelines.md +433 -0
- package/.agent/skills/research-paper-writing/references/sources.md +191 -0
- package/.agent/skills/research-paper-writing/references/writing-guide.md +474 -0
- package/.agent/skills/research-paper-writing/templates/README.md +251 -0
- package/.agent/skills/rest-graphql-debug/SKILL.md +507 -0
- package/.agent/skills/s6-container-supervision/SKILL.md +171 -0
- package/.agent/skills/scrapling/SKILL.md +328 -0
- package/.agent/skills/sherlock/SKILL.md +186 -0
- package/.agent/skills/simplify-code/SKILL.md +168 -0
- package/.agent/skills/skill-authoring/SKILL.md +158 -0
- package/.agent/skills/spike/SKILL.md +190 -0
- package/.agent/skills/subagent-driven-development/SKILL.md +345 -0
- package/.agent/skills/subagent-driven-development/references/context-budget-discipline.md +53 -0
- package/.agent/skills/subagent-driven-development/references/gates-taxonomy.md +93 -0
- package/.agent/skills/systematic-debugging/SKILL.md +360 -0
- package/.agent/skills/test-driven-development/SKILL.md +336 -0
- package/.agent/skills/video-orchestrator/SKILL.md +194 -0
- package/.agent/skills/video-orchestrator/references/examples.md +227 -0
- package/.agent/skills/video-orchestrator/references/intake.md +166 -0
- package/.agent/skills/video-orchestrator/references/kanban-setup.md +278 -0
- package/.agent/skills/video-orchestrator/references/monitoring.md +180 -0
- package/.agent/skills/video-orchestrator/references/role-archetypes.md +298 -0
- package/.agent/skills/video-orchestrator/references/tool-matrix.md +317 -0
- package/.agent/skills/web-pentest/SKILL.md +332 -0
- package/.agent/skills/web-pentest/references/bypass-techniques.md +133 -0
- package/.agent/skills/web-pentest/references/exploitation-techniques.md +204 -0
- package/.agent/skills/web-pentest/references/scope-enforcement.md +110 -0
- package/.agent/skills/web-pentest/references/vuln-taxonomy.md +81 -0
- package/.agent/skills/web-pentest/templates/authorization.md +69 -0
- package/.agent/skills/web-pentest/templates/pentest-report.md +178 -0
- package/.claude/commands/mindforge/skill-tdd.md +53 -0
- package/.claude/commands/mindforge/skills-index.md +118 -0
- package/.claude/commands/mindforge/systematic-debug.md +60 -0
- package/.claude/commands/mindforge/wf-catalog.md +37 -0
- package/.claude/commands/mindforge/wf-code-audit.md +31 -0
- package/.claude/commands/mindforge/wf-competitive-analysis.md +31 -0
- package/.claude/commands/mindforge/wf-deep-research.md +32 -0
- package/.claude/commands/mindforge/wf-feature-planner.md +31 -0
- package/.claude/commands/mindforge/wf-incident-response.md +31 -0
- package/.claude/commands/mindforge/wf-onboard-codebase.md +31 -0
- package/.claude/commands/mindforge/wf-perf-optimize.md +31 -0
- package/.claude/commands/mindforge/wf-pr-review.md +31 -0
- package/.claude/commands/mindforge/wf-refactor-plan.md +31 -0
- package/.claude/commands/mindforge/wf-release-prep.md +31 -0
- package/.claude/commands/mindforge/wf-tdd-sprint.md +31 -0
- package/.claude/commands/mindforge/wf-tech-evaluation.md +31 -0
- package/.mindforge/config.json +2 -2
- package/.mindforge/dynamic-workflows/REGISTRY.md +65 -0
- package/.mindforge/dynamic-workflows/index.json +171 -0
- package/.mindforge/dynamic-workflows/scripts/code-audit.js +103 -0
- package/.mindforge/dynamic-workflows/scripts/competitive-analysis.js +85 -0
- package/.mindforge/dynamic-workflows/scripts/deep-research.js +151 -0
- package/.mindforge/dynamic-workflows/scripts/feature-planner.js +104 -0
- package/.mindforge/dynamic-workflows/scripts/incident-response.js +106 -0
- package/.mindforge/dynamic-workflows/scripts/onboard-codebase.js +102 -0
- package/.mindforge/dynamic-workflows/scripts/perf-optimize.js +128 -0
- package/.mindforge/dynamic-workflows/scripts/pr-review.js +87 -0
- package/.mindforge/dynamic-workflows/scripts/refactor-plan.js +121 -0
- package/.mindforge/dynamic-workflows/scripts/release-prep.js +102 -0
- package/.mindforge/dynamic-workflows/scripts/tdd-sprint.js +103 -0
- package/.mindforge/dynamic-workflows/scripts/tech-evaluation.js +72 -0
- package/.mindforge/memory/sync-manifest.json +1 -1
- package/.mindforge/skills/arxiv/SKILL.md +294 -0
- package/.mindforge/skills/blogwatcher/SKILL.md +147 -0
- package/.mindforge/skills/code-wiki/SKILL.md +457 -0
- package/.mindforge/skills/codebase-inspection/SKILL.md +126 -0
- package/.mindforge/skills/concept-diagrams/SKILL.md +373 -0
- package/.mindforge/skills/creative-ideation/SKILL.md +162 -0
- package/.mindforge/skills/domain-intel/SKILL.md +116 -0
- package/.mindforge/skills/duckduckgo-search/SKILL.md +249 -0
- package/.mindforge/skills/github-code-review/SKILL.md +493 -0
- package/.mindforge/skills/github-issues/SKILL.md +382 -0
- package/.mindforge/skills/github-pr-workflow/SKILL.md +379 -0
- package/.mindforge/skills/jupyter-live-kernel/SKILL.md +179 -0
- package/.mindforge/skills/kanban-orchestrator/SKILL.md +227 -0
- package/.mindforge/skills/kanban-worker/SKILL.md +206 -0
- package/.mindforge/skills/meme-generation/SKILL.md +141 -0
- package/.mindforge/skills/obsidian/SKILL.md +80 -0
- package/.mindforge/skills/osint-investigation/SKILL.md +288 -0
- package/.mindforge/skills/oss-forensics/SKILL.md +421 -0
- package/.mindforge/skills/pixel-art/SKILL.md +228 -0
- package/.mindforge/skills/plan/SKILL.md +350 -0
- package/.mindforge/skills/requesting-code-review/SKILL.md +292 -0
- package/.mindforge/skills/research-paper-writing/SKILL.md +2384 -0
- package/.mindforge/skills/scrapling/SKILL.md +345 -0
- package/.mindforge/skills/sherlock/SKILL.md +203 -0
- package/.mindforge/skills/simplify-code/SKILL.md +187 -0
- package/.mindforge/skills/spike/SKILL.md +209 -0
- package/.mindforge/skills/subagent-driven-development/SKILL.md +364 -0
- package/.mindforge/skills/systematic-debugging/SKILL.md +379 -0
- package/.mindforge/skills/test-driven-development/SKILL.md +355 -0
- package/.mindforge/skills/web-pentest/SKILL.md +327 -0
- package/CHANGELOG.md +71 -0
- package/MINDFORGE.md +2 -2
- package/README.md +72 -3
- package/RELEASENOTES.md +109 -0
- package/bin/installer-core.js +6 -2
- package/bin/mindforge-cli.js +7 -0
- package/bin/workflows/workflow-runner.js +110 -0
- package/docs/commands-reference.md +25 -0
- package/docs/getting-started.md +42 -5
- package/package.json +2 -1
|
@@ -0,0 +1,345 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: scrapling
|
|
3
|
+
description: "Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python."
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
status: stable
|
|
6
|
+
min_mindforge_version: 11.5.1
|
|
7
|
+
triggers: scrape website, web scraping, extract web content, scrape page, web page scraping, extract from website, html scraping, scrape data, web extraction, crawl page, scrape url, web content extraction
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Scrapling
|
|
11
|
+
|
|
12
|
+
[Scrapling](https://github.com/D4Vinci/Scrapling) is a web scraping framework with anti-bot bypass, stealth browser automation, and a spider framework. It provides three fetching strategies (HTTP, dynamic JS, stealth/Cloudflare) and a full CLI.
|
|
13
|
+
|
|
14
|
+
**This skill is for educational and research purposes only.** Users must comply with local/international data scraping laws and respect website Terms of Service.
|
|
15
|
+
|
|
16
|
+
## When to Use
|
|
17
|
+
|
|
18
|
+
- Scraping static HTML pages (faster than browser tools)
|
|
19
|
+
- Scraping JS-rendered pages that need a real browser
|
|
20
|
+
- Bypassing Cloudflare Turnstile or bot detection
|
|
21
|
+
- Crawling multiple pages with a spider
|
|
22
|
+
- When the built-in `web_extract` tool does not return the data you need
|
|
23
|
+
|
|
24
|
+
## Installation
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
pip install "scrapling[all]"
|
|
28
|
+
scrapling install
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
Minimal install (HTTP only, no browser):
|
|
32
|
+
```bash
|
|
33
|
+
pip install scrapling
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
With browser automation only:
|
|
37
|
+
```bash
|
|
38
|
+
pip install "scrapling[fetchers]"
|
|
39
|
+
scrapling install
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Quick Reference
|
|
43
|
+
|
|
44
|
+
| Approach | Class | Use When |
|
|
45
|
+
|----------|-------|----------|
|
|
46
|
+
| HTTP | `Fetcher` / `FetcherSession` | Static pages, APIs, fast bulk requests |
|
|
47
|
+
| Dynamic | `DynamicFetcher` / `DynamicSession` | JS-rendered content, SPAs |
|
|
48
|
+
| Stealth | `StealthyFetcher` / `StealthySession` | Cloudflare, anti-bot protected sites |
|
|
49
|
+
| Spider | `Spider` | Multi-page crawling with link following |
|
|
50
|
+
|
|
51
|
+
## CLI Usage
|
|
52
|
+
|
|
53
|
+
### Extract Static Page
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
scrapling extract get 'https://example.com' output.md
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
With CSS selector and browser impersonation:
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
scrapling extract get 'https://example.com' output.md \
|
|
63
|
+
--css-selector '.content' \
|
|
64
|
+
--impersonate 'chrome'
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
### Extract JS-Rendered Page
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
scrapling extract fetch 'https://example.com' output.md \
|
|
71
|
+
--css-selector '.dynamic-content' \
|
|
72
|
+
--disable-resources \
|
|
73
|
+
--network-idle
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
### Extract Cloudflare-Protected Page
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
scrapling extract stealthy-fetch 'https://protected-site.com' output.html \
|
|
80
|
+
--solve-cloudflare \
|
|
81
|
+
--block-webrtc \
|
|
82
|
+
--hide-canvas
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
### POST Request
|
|
86
|
+
|
|
87
|
+
```bash
|
|
88
|
+
scrapling extract post 'https://example.com/api' output.json \
|
|
89
|
+
--json '{"query": "search term"}'
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
### Output Formats
|
|
93
|
+
|
|
94
|
+
The output format is determined by the file extension:
|
|
95
|
+
- `.html` -- raw HTML
|
|
96
|
+
- `.md` -- converted to Markdown
|
|
97
|
+
- `.txt` -- plain text
|
|
98
|
+
- `.json` / `.jsonl` -- JSON
|
|
99
|
+
|
|
100
|
+
## Python: HTTP Scraping
|
|
101
|
+
|
|
102
|
+
### Single Request
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
from scrapling.fetchers import Fetcher
|
|
106
|
+
|
|
107
|
+
page = Fetcher.get('https://quotes.toscrape.com/')
|
|
108
|
+
quotes = page.css('.quote .text::text').getall()
|
|
109
|
+
for q in quotes:
|
|
110
|
+
print(q)
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
### Session (Persistent Cookies)
|
|
114
|
+
|
|
115
|
+
```python
|
|
116
|
+
from scrapling.fetchers import FetcherSession
|
|
117
|
+
|
|
118
|
+
with FetcherSession(impersonate='chrome') as session:
|
|
119
|
+
page = session.get('https://example.com/', stealthy_headers=True)
|
|
120
|
+
links = page.css('a::attr(href)').getall()
|
|
121
|
+
for link in links[:5]:
|
|
122
|
+
sub = session.get(link)
|
|
123
|
+
print(sub.css('h1::text').get())
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
### POST / PUT / DELETE
|
|
127
|
+
|
|
128
|
+
```python
|
|
129
|
+
page = Fetcher.post('https://api.example.com/data', json={"key": "value"})
|
|
130
|
+
page = Fetcher.put('https://api.example.com/item/1', data={"name": "updated"})
|
|
131
|
+
page = Fetcher.delete('https://api.example.com/item/1')
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
### With Proxy
|
|
135
|
+
|
|
136
|
+
```python
|
|
137
|
+
page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
## Python: Dynamic Pages (JS-Rendered)
|
|
141
|
+
|
|
142
|
+
For pages that require JavaScript execution (SPAs, lazy-loaded content):
|
|
143
|
+
|
|
144
|
+
```python
|
|
145
|
+
from scrapling.fetchers import DynamicFetcher
|
|
146
|
+
|
|
147
|
+
page = DynamicFetcher.fetch('https://example.com', headless=True)
|
|
148
|
+
data = page.css('.js-loaded-content::text').getall()
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
### Wait for Specific Element
|
|
152
|
+
|
|
153
|
+
```python
|
|
154
|
+
page = DynamicFetcher.fetch(
|
|
155
|
+
'https://example.com',
|
|
156
|
+
wait_selector=('.results', 'visible'),
|
|
157
|
+
network_idle=True,
|
|
158
|
+
)
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
### Disable Resources for Speed
|
|
162
|
+
|
|
163
|
+
Blocks fonts, images, media, stylesheets (~25% faster):
|
|
164
|
+
|
|
165
|
+
```python
|
|
166
|
+
from scrapling.fetchers import DynamicSession
|
|
167
|
+
|
|
168
|
+
with DynamicSession(headless=True, disable_resources=True, network_idle=True) as session:
|
|
169
|
+
page = session.fetch('https://example.com')
|
|
170
|
+
items = page.css('.item::text').getall()
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
### Custom Page Automation
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
from playwright.sync_api import Page
|
|
177
|
+
from scrapling.fetchers import DynamicFetcher
|
|
178
|
+
|
|
179
|
+
def scroll_and_click(page: Page):
|
|
180
|
+
page.mouse.wheel(0, 3000)
|
|
181
|
+
page.wait_for_timeout(1000)
|
|
182
|
+
page.click('button.load-more')
|
|
183
|
+
page.wait_for_selector('.extra-results')
|
|
184
|
+
|
|
185
|
+
page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click)
|
|
186
|
+
results = page.css('.extra-results .item::text').getall()
|
|
187
|
+
```
|
|
188
|
+
|
|
189
|
+
## Python: Stealth Mode (Anti-Bot Bypass)
|
|
190
|
+
|
|
191
|
+
For Cloudflare-protected or heavily fingerprinted sites:
|
|
192
|
+
|
|
193
|
+
```python
|
|
194
|
+
from scrapling.fetchers import StealthyFetcher
|
|
195
|
+
|
|
196
|
+
page = StealthyFetcher.fetch(
|
|
197
|
+
'https://protected-site.com',
|
|
198
|
+
headless=True,
|
|
199
|
+
solve_cloudflare=True,
|
|
200
|
+
block_webrtc=True,
|
|
201
|
+
hide_canvas=True,
|
|
202
|
+
)
|
|
203
|
+
content = page.css('.protected-content::text').getall()
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
### Stealth Session
|
|
207
|
+
|
|
208
|
+
```python
|
|
209
|
+
from scrapling.fetchers import StealthySession
|
|
210
|
+
|
|
211
|
+
with StealthySession(headless=True, solve_cloudflare=True) as session:
|
|
212
|
+
page1 = session.fetch('https://protected-site.com/page1')
|
|
213
|
+
page2 = session.fetch('https://protected-site.com/page2')
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
## Element Selection
|
|
217
|
+
|
|
218
|
+
All fetchers return a `Selector` object with these methods:
|
|
219
|
+
|
|
220
|
+
### CSS Selectors
|
|
221
|
+
|
|
222
|
+
```python
|
|
223
|
+
page.css('h1::text').get() # First h1 text
|
|
224
|
+
page.css('a::attr(href)').getall() # All link hrefs
|
|
225
|
+
page.css('.quote .text::text').getall() # Nested selection
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
### XPath
|
|
229
|
+
|
|
230
|
+
```python
|
|
231
|
+
page.xpath('//div[@class="content"]/text()').getall()
|
|
232
|
+
page.xpath('//a/@href').getall()
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
### Find Methods
|
|
236
|
+
|
|
237
|
+
```python
|
|
238
|
+
page.find_all('div', class_='quote') # By tag + attribute
|
|
239
|
+
page.find_by_text('Read more', tag='a') # By text content
|
|
240
|
+
page.find_by_regex(r'\$\d+\.\d{2}') # By regex pattern
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
### Similar Elements
|
|
244
|
+
|
|
245
|
+
Find elements with similar structure (useful for product listings, etc.):
|
|
246
|
+
|
|
247
|
+
```python
|
|
248
|
+
first_product = page.css('.product')[0]
|
|
249
|
+
all_similar = first_product.find_similar()
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
### Navigation
|
|
253
|
+
|
|
254
|
+
```python
|
|
255
|
+
el = page.css('.target')[0]
|
|
256
|
+
el.parent # Parent element
|
|
257
|
+
el.children # Child elements
|
|
258
|
+
el.next_sibling # Next sibling
|
|
259
|
+
el.prev_sibling # Previous sibling
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
## Python: Spider Framework
|
|
263
|
+
|
|
264
|
+
For multi-page crawling with link following:
|
|
265
|
+
|
|
266
|
+
```python
|
|
267
|
+
from scrapling.spiders import Spider, Request, Response
|
|
268
|
+
|
|
269
|
+
class QuotesSpider(Spider):
|
|
270
|
+
name = "quotes"
|
|
271
|
+
start_urls = ["https://quotes.toscrape.com/"]
|
|
272
|
+
concurrent_requests = 10
|
|
273
|
+
download_delay = 1
|
|
274
|
+
|
|
275
|
+
async def parse(self, response: Response):
|
|
276
|
+
for quote in response.css('.quote'):
|
|
277
|
+
yield {
|
|
278
|
+
"text": quote.css('.text::text').get(),
|
|
279
|
+
"author": quote.css('.author::text').get(),
|
|
280
|
+
"tags": quote.css('.tag::text').getall(),
|
|
281
|
+
}
|
|
282
|
+
|
|
283
|
+
next_page = response.css('.next a::attr(href)').get()
|
|
284
|
+
if next_page:
|
|
285
|
+
yield response.follow(next_page)
|
|
286
|
+
|
|
287
|
+
result = QuotesSpider().start()
|
|
288
|
+
print(f"Scraped {len(result.items)} quotes")
|
|
289
|
+
result.items.to_json("quotes.json")
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
### Multi-Session Spider
|
|
293
|
+
|
|
294
|
+
Route requests to different fetcher types:
|
|
295
|
+
|
|
296
|
+
```python
|
|
297
|
+
from scrapling.fetchers import FetcherSession, AsyncStealthySession
|
|
298
|
+
|
|
299
|
+
class SmartSpider(Spider):
|
|
300
|
+
name = "smart"
|
|
301
|
+
start_urls = ["https://example.com/"]
|
|
302
|
+
|
|
303
|
+
def configure_sessions(self, manager):
|
|
304
|
+
manager.add("fast", FetcherSession(impersonate="chrome"))
|
|
305
|
+
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
|
|
306
|
+
|
|
307
|
+
async def parse(self, response: Response):
|
|
308
|
+
for link in response.css('a::attr(href)').getall():
|
|
309
|
+
if "protected" in link:
|
|
310
|
+
yield Request(link, sid="stealth")
|
|
311
|
+
else:
|
|
312
|
+
yield Request(link, sid="fast", callback=self.parse)
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
### Pause/Resume Crawling
|
|
316
|
+
|
|
317
|
+
```python
|
|
318
|
+
spider = QuotesSpider(crawldir="./crawl_checkpoint")
|
|
319
|
+
spider.start() # Ctrl+C to pause, re-run to resume from checkpoint
|
|
320
|
+
```
|
|
321
|
+
|
|
322
|
+
## Pitfalls
|
|
323
|
+
|
|
324
|
+
- **Browser install required**: run `scrapling install` after pip install -- without it, `DynamicFetcher` and `StealthyFetcher` will fail
|
|
325
|
+
- **Timeouts**: DynamicFetcher/StealthyFetcher timeout is in **milliseconds** (default 30000), Fetcher timeout is in **seconds**
|
|
326
|
+
- **Cloudflare bypass**: `solve_cloudflare=True` adds 5-15 seconds to fetch time -- only enable when needed
|
|
327
|
+
- **Resource usage**: StealthyFetcher runs a real browser -- limit concurrent usage
|
|
328
|
+
- **Legal**: always check robots.txt and website ToS before scraping. This library is for educational and research purposes
|
|
329
|
+
- **Python version**: requires Python 3.10+
|
|
330
|
+
|
|
331
|
+
## Mandatory actions when this skill is active
|
|
332
|
+
|
|
333
|
+
Before applying this skill:
|
|
334
|
+
- [ ] Read the task requirements fully before acting
|
|
335
|
+
- [ ] Confirm you understand the goal and constraints
|
|
336
|
+
- [ ] Check for existing work or prior context in the codebase
|
|
337
|
+
|
|
338
|
+
While working:
|
|
339
|
+
- [ ] Follow the methodology described above step by step
|
|
340
|
+
- [ ] Document any decisions or findings as you go
|
|
341
|
+
|
|
342
|
+
After completing:
|
|
343
|
+
- [ ] Self-check: does the output satisfy the original requirement?
|
|
344
|
+
- [ ] Verify no regressions or unintended side effects
|
|
345
|
+
|
|
@@ -0,0 +1,203 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: sherlock
|
|
3
|
+
description: "OSINT username search across 400+ social networks. Hunt down social media accounts by username."
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
status: stable
|
|
6
|
+
min_mindforge_version: 11.5.1
|
|
7
|
+
triggers: sherlock, username investigation, find accounts, OSINT username, social media investigation, find social accounts, username search, account discovery, username osint, find profiles, sherlock username, account investigation
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Sherlock OSINT Username Search
|
|
11
|
+
|
|
12
|
+
Hunt down social media accounts by username across 400+ social networks using the [Sherlock Project](https://github.com/sherlock-project/sherlock).
|
|
13
|
+
|
|
14
|
+
## When to Use
|
|
15
|
+
|
|
16
|
+
- User asks to find accounts associated with a username
|
|
17
|
+
- User wants to check username availability across platforms
|
|
18
|
+
- User is conducting OSINT or reconnaissance research
|
|
19
|
+
- User asks "where is this username registered?" or similar
|
|
20
|
+
|
|
21
|
+
## Requirements
|
|
22
|
+
|
|
23
|
+
- Sherlock CLI installed: `pipx install sherlock-project` or `pip install sherlock-project`
|
|
24
|
+
- Alternatively: Docker available (`docker run -it --rm sherlock/sherlock`)
|
|
25
|
+
- Network access to query social platforms
|
|
26
|
+
|
|
27
|
+
## Procedure
|
|
28
|
+
|
|
29
|
+
### 1. Check if Sherlock is Installed
|
|
30
|
+
|
|
31
|
+
**Before doing anything else**, verify sherlock is available:
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
sherlock --version
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
If the command fails:
|
|
38
|
+
- Offer to install: `pipx install sherlock-project` (recommended) or `pip install sherlock-project`
|
|
39
|
+
- **Do NOT** try multiple installation methods — pick one and proceed
|
|
40
|
+
- If installation fails, inform the user and stop
|
|
41
|
+
|
|
42
|
+
### 2. Extract Username
|
|
43
|
+
|
|
44
|
+
**Extract the username directly from the user's message if clearly stated.**
|
|
45
|
+
|
|
46
|
+
Examples where you should **NOT** use clarify:
|
|
47
|
+
- "Find accounts for nasa" → username is `nasa`
|
|
48
|
+
- "Search for johndoe123" → username is `johndoe123`
|
|
49
|
+
- "Check if alice exists on social media" → username is `alice`
|
|
50
|
+
- "Look up user bob on social networks" → username is `bob`
|
|
51
|
+
|
|
52
|
+
**Only use clarify if:**
|
|
53
|
+
- Multiple potential usernames mentioned ("search for alice or bob")
|
|
54
|
+
- Ambiguous phrasing ("search for my username" without specifying)
|
|
55
|
+
- No username mentioned at all ("do an OSINT search")
|
|
56
|
+
|
|
57
|
+
When extracting, take the **exact** username as stated — preserve case, numbers, underscores, etc.
|
|
58
|
+
|
|
59
|
+
### 3. Build Command
|
|
60
|
+
|
|
61
|
+
**Default command** (use this unless user specifically requests otherwise):
|
|
62
|
+
```bash
|
|
63
|
+
sherlock --print-found --no-color "<username>" --timeout 90
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**Optional flags** (only add if user explicitly requests):
|
|
67
|
+
- `--nsfw` — Include NSFW sites (only if user asks)
|
|
68
|
+
- `--tor` — Route through Tor (only if user asks for anonymity)
|
|
69
|
+
|
|
70
|
+
**Do NOT ask about options via clarify** — just run the default search. Users can request specific options if needed.
|
|
71
|
+
|
|
72
|
+
### 4. Execute Search
|
|
73
|
+
|
|
74
|
+
Run via the `terminal` tool. The command typically takes 30-120 seconds depending on network conditions and site count.
|
|
75
|
+
|
|
76
|
+
**Example terminal call:**
|
|
77
|
+
```json
|
|
78
|
+
{
|
|
79
|
+
"command": "sherlock --print-found --no-color \"target_username\"",
|
|
80
|
+
"timeout": 180
|
|
81
|
+
}
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### 5. Parse and Present Results
|
|
85
|
+
|
|
86
|
+
Sherlock outputs found accounts in a simple format. Parse the output and present:
|
|
87
|
+
|
|
88
|
+
1. **Summary line:** "Found X accounts for username 'Y'"
|
|
89
|
+
2. **Categorized links:** Group by platform type if helpful (social, professional, forums, etc.)
|
|
90
|
+
3. **Output file location:** Sherlock saves results to `<username>.txt` by default
|
|
91
|
+
|
|
92
|
+
**Example output parsing:**
|
|
93
|
+
```
|
|
94
|
+
[+] Instagram: https://instagram.com/username
|
|
95
|
+
[+] Twitter: https://twitter.com/username
|
|
96
|
+
[+] GitHub: https://github.com/username
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
Present findings as clickable links when possible.
|
|
100
|
+
|
|
101
|
+
## Pitfalls
|
|
102
|
+
|
|
103
|
+
### No Results Found
|
|
104
|
+
If Sherlock finds no accounts, this is often correct — the username may not be registered on checked platforms. Suggest:
|
|
105
|
+
- Checking spelling/variation
|
|
106
|
+
- Trying similar usernames with `?` wildcard: `sherlock "user?name"`
|
|
107
|
+
- The user may have privacy settings or deleted accounts
|
|
108
|
+
|
|
109
|
+
### Timeout Issues
|
|
110
|
+
Some sites are slow or block automated requests. Use `--timeout 120` to increase wait time, or `--site` to limit scope.
|
|
111
|
+
|
|
112
|
+
### Tor Configuration
|
|
113
|
+
`--tor` requires Tor daemon running. If user wants anonymity but Tor isn't available, suggest:
|
|
114
|
+
- Installing Tor service
|
|
115
|
+
- Using `--proxy` with an alternative proxy
|
|
116
|
+
|
|
117
|
+
### False Positives
|
|
118
|
+
Some sites always return "found" due to their response structure. Cross-reference unexpected results with manual checks.
|
|
119
|
+
|
|
120
|
+
### Rate Limiting
|
|
121
|
+
Aggressive searches may trigger rate limits. For bulk username searches, add delays between calls or use `--local` with cached data.
|
|
122
|
+
|
|
123
|
+
## Installation
|
|
124
|
+
|
|
125
|
+
### pipx (recommended)
|
|
126
|
+
```bash
|
|
127
|
+
pipx install sherlock-project
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### pip
|
|
131
|
+
```bash
|
|
132
|
+
pip install sherlock-project
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Docker
|
|
136
|
+
```bash
|
|
137
|
+
docker pull sherlock/sherlock
|
|
138
|
+
docker run -it --rm sherlock/sherlock <username>
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### Linux packages
|
|
142
|
+
Available on Debian 13+, Ubuntu 22.10+, Homebrew, Kali, BlackArch.
|
|
143
|
+
|
|
144
|
+
## Ethical Use
|
|
145
|
+
|
|
146
|
+
This tool is for legitimate OSINT and research purposes only. Remind users:
|
|
147
|
+
- Only search usernames they own or have permission to investigate
|
|
148
|
+
- Respect platform terms of service
|
|
149
|
+
- Do not use for harassment, stalking, or illegal activities
|
|
150
|
+
- Consider privacy implications before sharing results
|
|
151
|
+
|
|
152
|
+
## Verification
|
|
153
|
+
|
|
154
|
+
After running sherlock, verify:
|
|
155
|
+
1. Output lists found sites with URLs
|
|
156
|
+
2. `<username>.txt` file created (default output) if using file output
|
|
157
|
+
3. If `--print-found` used, output should only contain `[+]` lines for matches
|
|
158
|
+
|
|
159
|
+
## Example Interaction
|
|
160
|
+
|
|
161
|
+
**User:** "Can you check if the username 'johndoe123' exists on social media?"
|
|
162
|
+
|
|
163
|
+
**Agent procedure:**
|
|
164
|
+
1. Check `sherlock --version` (verify installed)
|
|
165
|
+
2. Username provided — proceed directly
|
|
166
|
+
3. Run: `sherlock --print-found --no-color "johndoe123" --timeout 90`
|
|
167
|
+
4. Parse output and present links
|
|
168
|
+
|
|
169
|
+
**Response format:**
|
|
170
|
+
> Found 12 accounts for username 'johndoe123':
|
|
171
|
+
>
|
|
172
|
+
> • https://twitter.com/johndoe123
|
|
173
|
+
> • https://github.com/johndoe123
|
|
174
|
+
> • https://instagram.com/johndoe123
|
|
175
|
+
> • [... additional links]
|
|
176
|
+
>
|
|
177
|
+
> Results saved to: johndoe123.txt
|
|
178
|
+
|
|
179
|
+
---
|
|
180
|
+
|
|
181
|
+
**User:** "Search for username 'alice' including NSFW sites"
|
|
182
|
+
|
|
183
|
+
**Agent procedure:**
|
|
184
|
+
1. Check sherlock installed
|
|
185
|
+
2. Username + NSFW flag both provided
|
|
186
|
+
3. Run: `sherlock --print-found --no-color --nsfw "alice" --timeout 90`
|
|
187
|
+
4. Present results
|
|
188
|
+
|
|
189
|
+
## Mandatory actions when this skill is active
|
|
190
|
+
|
|
191
|
+
Before applying this skill:
|
|
192
|
+
- [ ] Read the task requirements fully before acting
|
|
193
|
+
- [ ] Confirm you understand the goal and constraints
|
|
194
|
+
- [ ] Check for existing work or prior context in the codebase
|
|
195
|
+
|
|
196
|
+
While working:
|
|
197
|
+
- [ ] Follow the methodology described above step by step
|
|
198
|
+
- [ ] Document any decisions or findings as you go
|
|
199
|
+
|
|
200
|
+
After completing:
|
|
201
|
+
- [ ] Self-check: does the output satisfy the original requirement?
|
|
202
|
+
- [ ] Verify no regressions or unintended side effects
|
|
203
|
+
|