ultimate-pi 0.1.0 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (509) hide show
  1. package/.agents/skills/ck-search/SKILL.md +99 -0
  2. package/.agents/skills/defuddle/SKILL.md +90 -0
  3. package/.agents/skills/find-skills/SKILL.md +142 -0
  4. package/.agents/skills/firecrawl/SKILL.md +150 -0
  5. package/.agents/skills/firecrawl/rules/install.md +82 -0
  6. package/.agents/skills/firecrawl/rules/security.md +26 -0
  7. package/.agents/skills/firecrawl-agent/SKILL.md +57 -0
  8. package/.agents/skills/firecrawl-build-interact/SKILL.md +67 -0
  9. package/.agents/skills/firecrawl-build-onboarding/SKILL.md +102 -0
  10. package/.agents/skills/firecrawl-build-onboarding/references/auth-flow.md +39 -0
  11. package/.agents/skills/firecrawl-build-onboarding/references/project-setup.md +20 -0
  12. package/.agents/skills/firecrawl-build-onboarding/references/sdk-installation.md +17 -0
  13. package/.agents/skills/firecrawl-build-scrape/SKILL.md +68 -0
  14. package/.agents/skills/firecrawl-build-search/SKILL.md +68 -0
  15. package/.agents/skills/firecrawl-crawl/SKILL.md +58 -0
  16. package/.agents/skills/firecrawl-download/SKILL.md +69 -0
  17. package/.agents/skills/firecrawl-interact/SKILL.md +83 -0
  18. package/.agents/skills/firecrawl-map/SKILL.md +50 -0
  19. package/.agents/skills/firecrawl-parse/SKILL.md +61 -0
  20. package/.agents/skills/firecrawl-scrape/SKILL.md +68 -0
  21. package/.agents/skills/firecrawl-search/SKILL.md +59 -0
  22. package/.agents/skills/obsidian-bases/SKILL.md +299 -0
  23. package/.agents/skills/obsidian-markdown/SKILL.md +237 -0
  24. package/.agents/skills/posthog-analyst/SKILL.md +306 -0
  25. package/.agents/skills/posthog-analyst/evals/evals.json +23 -0
  26. package/.agents/skills/wiki/SKILL.md +215 -0
  27. package/.agents/skills/wiki/references/css-snippets.md +122 -0
  28. package/.agents/skills/wiki/references/frontmatter.md +107 -0
  29. package/.agents/skills/wiki/references/git-setup.md +58 -0
  30. package/.agents/skills/wiki/references/mcp-setup.md +149 -0
  31. package/.agents/skills/wiki/references/modes.md +259 -0
  32. package/.agents/skills/wiki/references/plugins.md +96 -0
  33. package/.agents/skills/wiki/references/rest-api.md +124 -0
  34. package/.agents/skills/wiki-autoresearch/SKILL.md +211 -0
  35. package/.agents/skills/wiki-autoresearch/references/program.md +75 -0
  36. package/.agents/skills/wiki-fold/SKILL.md +204 -0
  37. package/.agents/skills/wiki-fold/references/fold-template.md +133 -0
  38. package/.agents/skills/wiki-ingest/SKILL.md +288 -0
  39. package/.agents/skills/wiki-lint/SKILL.md +183 -0
  40. package/.agents/skills/wiki-query/SKILL.md +176 -0
  41. package/.agents/skills/wiki-save/SKILL.md +128 -0
  42. package/.ckignore +41 -0
  43. package/.env.example +9 -0
  44. package/.github/banner-v2.png +0 -0
  45. package/.github/workflows/lint.yml +33 -0
  46. package/.github/workflows/publish-github-packages.yml +35 -0
  47. package/.github/workflows/publish-npm.yml +32 -0
  48. package/.pi/SYSTEM.md +107 -40
  49. package/.pi/agents/pi-pi/agent-expert.md +205 -0
  50. package/.pi/agents/pi-pi/cli-expert.md +47 -0
  51. package/.pi/agents/pi-pi/config-expert.md +67 -0
  52. package/.pi/agents/pi-pi/ext-expert.md +53 -0
  53. package/.pi/agents/pi-pi/keybinding-expert.md +123 -0
  54. package/.pi/agents/pi-pi/pi-orchestrator.md +103 -0
  55. package/.pi/agents/pi-pi/prompt-expert.md +83 -0
  56. package/.pi/agents/pi-pi/skill-expert.md +52 -0
  57. package/.pi/agents/pi-pi/theme-expert.md +46 -0
  58. package/.pi/agents/pi-pi/tui-expert.md +100 -0
  59. package/.pi/agents/rethink.md +140 -0
  60. package/.pi/agents/wiki-ingest.md +67 -0
  61. package/.pi/agents/wiki-lint.md +75 -0
  62. package/.pi/auto-commit.json +20 -0
  63. package/.pi/extensions/banner.png +0 -0
  64. package/.pi/extensions/ck-enforce.ts +216 -0
  65. package/.pi/extensions/custom-footer.ts +308 -0
  66. package/.pi/extensions/custom-header.ts +116 -0
  67. package/.pi/extensions/dotenv-loader.ts +170 -0
  68. package/.pi/internal/cursor-sdk-transcript-parser.ts +59 -0
  69. package/.pi/model-router.json +95 -0
  70. package/.pi/npm/.gitignore +2 -0
  71. package/.pi/prompts/git-sync.md +124 -0
  72. package/.pi/prompts/harness-setup.md +509 -0
  73. package/.pi/prompts/save.md +16 -0
  74. package/.pi/prompts/wiki-autoresearch.md +19 -0
  75. package/.pi/prompts/wiki.md +23 -0
  76. package/.pi/providers/cursor-sdk-provider.test.mjs +476 -0
  77. package/.pi/providers/cursor-sdk-provider.ts +1085 -0
  78. package/.pi/settings.json +14 -4
  79. package/.pi/skills/agent-router/SKILL.md +174 -0
  80. package/.pi/sounds/alert/1-kaching-track.mp3 +0 -0
  81. package/.pi/sounds/error/1-ksi-wth-track.mp3 +0 -0
  82. package/.pi/sounds/error/2-smash-track.mp3 +0 -0
  83. package/.pi/sounds/error/3-buzzer-track.mp3 +0 -0
  84. package/.pi/sounds/notification/1-soft-notification-track.mp3 +0 -0
  85. package/.pi/sounds/project-sounds.json +25 -0
  86. package/.pi/sounds/reminder/1-soft-notification-track.mp3 +0 -0
  87. package/.pi/sounds/success/1-tada-track.mp3 +0 -0
  88. package/.pi/sounds/success/2-jobs-done-track.mp3 +0 -0
  89. package/.pi/sounds/success/3-yay-track.mp3 +0 -0
  90. package/CONTRIBUTING.md +116 -0
  91. package/README.md +33 -40
  92. package/biome.json +34 -0
  93. package/firecrawl/.env.template +58 -0
  94. package/firecrawl/README.md +49 -0
  95. package/firecrawl/docker-compose.yaml +201 -0
  96. package/firecrawl/searxng/searxng.env +3 -0
  97. package/firecrawl/searxng/settings.yml +85 -0
  98. package/lefthook.yml +8 -0
  99. package/package.json +55 -16
  100. package/vault/AGENTS.md +37 -0
  101. package/vault/wiki/_templates/comparison.md +39 -0
  102. package/vault/wiki/_templates/concept.md +40 -0
  103. package/vault/wiki/_templates/decision.md +21 -0
  104. package/vault/wiki/_templates/entity.md +32 -0
  105. package/vault/wiki/_templates/flow.md +14 -0
  106. package/vault/wiki/_templates/module.md +18 -0
  107. package/vault/wiki/_templates/question.md +31 -0
  108. package/vault/wiki/_templates/source.md +39 -0
  109. package/vault/wiki/concepts/AST-Aware Code Chunking.md +44 -0
  110. package/vault/wiki/concepts/Build-Time Prompt Compilation.md +107 -0
  111. package/vault/wiki/concepts/Context Engine (AI Coding).md +47 -0
  112. package/vault/wiki/concepts/Context-Aware System Reminders.md +61 -0
  113. package/vault/wiki/concepts/Contextualized Text Embedding.md +42 -0
  114. package/vault/wiki/concepts/Contractor vs Employee AI Model.md +55 -0
  115. package/vault/wiki/concepts/Dual-Model Agent Architecture.md +65 -0
  116. package/vault/wiki/concepts/Late Chunking vs Early Chunking.md +43 -0
  117. package/vault/wiki/concepts/Majority Vote Ensembling.md +68 -0
  118. package/vault/wiki/concepts/Meta-Harness.md +16 -0
  119. package/vault/wiki/concepts/Multi-Agent AI Coding Architecture.md +75 -0
  120. package/vault/wiki/concepts/Prompt Enhancement.md +90 -0
  121. package/vault/wiki/concepts/Prompt Renderer.md +89 -0
  122. package/vault/wiki/concepts/Semantic Codebase Indexing.md +67 -0
  123. package/vault/wiki/concepts/additive-config-hierarchy.md +16 -0
  124. package/vault/wiki/concepts/agent-artifacts-verifiable-deliverables.md +71 -0
  125. package/vault/wiki/concepts/agent-browser-browser-automation.md +99 -0
  126. package/vault/wiki/concepts/agent-codebase-interface.md +43 -0
  127. package/vault/wiki/concepts/agent-harness-architecture.md +67 -0
  128. package/vault/wiki/concepts/agent-loop-detection-patterns.md +133 -0
  129. package/vault/wiki/concepts/agent-search-enforcement.md +126 -0
  130. package/vault/wiki/concepts/agent-skills-ecosystem.md +74 -0
  131. package/vault/wiki/concepts/agent-skills-pattern.md +68 -0
  132. package/vault/wiki/concepts/agentic-harness-context-enforcement.md +91 -0
  133. package/vault/wiki/concepts/agentic-harness.md +34 -0
  134. package/vault/wiki/concepts/agentic-orchestration-pipeline.md +56 -0
  135. package/vault/wiki/concepts/agentic-search-no-embeddings.md +18 -0
  136. package/vault/wiki/concepts/anthropic-context-engineering.md +13 -0
  137. package/vault/wiki/concepts/antigravity-agent-first-architecture.md +61 -0
  138. package/vault/wiki/concepts/ast-compression.md +19 -0
  139. package/vault/wiki/concepts/ast-truncation.md +66 -0
  140. package/vault/wiki/concepts/barrel-files.md +37 -0
  141. package/vault/wiki/concepts/browser-harness-agent.md +41 -0
  142. package/vault/wiki/concepts/browser-subagent-visual-verification.md +82 -0
  143. package/vault/wiki/concepts/codebase-intelligence-ecosystem-comparison.md +192 -0
  144. package/vault/wiki/concepts/codebase-intelligence-harness-integration.md +161 -0
  145. package/vault/wiki/concepts/codebase-to-context-ingestion.md +46 -0
  146. package/vault/wiki/concepts/codex-harness-innovations.md +147 -0
  147. package/vault/wiki/concepts/consensus-debate-flow.md +17 -0
  148. package/vault/wiki/concepts/consensus-debate.md +206 -0
  149. package/vault/wiki/concepts/content-addressed-spec-identity.md +166 -0
  150. package/vault/wiki/concepts/context-anxiety.md +57 -0
  151. package/vault/wiki/concepts/context-compression-techniques.md +19 -0
  152. package/vault/wiki/concepts/context-continuity.md +22 -0
  153. package/vault/wiki/concepts/context-drift-in-agents.md +106 -0
  154. package/vault/wiki/concepts/context-engineering.md +62 -0
  155. package/vault/wiki/concepts/context-folding.md +67 -0
  156. package/vault/wiki/concepts/context-mode.md +38 -0
  157. package/vault/wiki/concepts/cursor-harness-innovations.md +107 -0
  158. package/vault/wiki/concepts/deterministic-session-compaction.md +79 -0
  159. package/vault/wiki/concepts/drift-detection-unified.md +296 -0
  160. package/vault/wiki/concepts/execution-feedback-loop.md +46 -0
  161. package/vault/wiki/concepts/feedforward-feedback-harness.md +60 -0
  162. package/vault/wiki/concepts/five-root-cause-metrics-sentrux.md +40 -0
  163. package/vault/wiki/concepts/fork-safe-spec-storage.md +89 -0
  164. package/vault/wiki/concepts/fts5-sandbox.md +19 -0
  165. package/vault/wiki/concepts/fuzzy-edit-matching.md +71 -0
  166. package/vault/wiki/concepts/gemini-cli-architecture.md +104 -0
  167. package/vault/wiki/concepts/generator-evaluator-architecture.md +64 -0
  168. package/vault/wiki/concepts/guardian-agent-pattern.md +67 -0
  169. package/vault/wiki/concepts/harness-configuration-layers.md +89 -0
  170. package/vault/wiki/concepts/harness-control-frameworks.md +155 -0
  171. package/vault/wiki/concepts/harness-engineering-first-principles.md +90 -0
  172. package/vault/wiki/concepts/harness-h-formalism.md +53 -0
  173. package/vault/wiki/concepts/hybrid-code-search.md +61 -0
  174. package/vault/wiki/concepts/inline-post-edit-validation.md +112 -0
  175. package/vault/wiki/concepts/legendary-engineering-patterns-harness.md +110 -0
  176. package/vault/wiki/concepts/lifecycle-hooks.md +94 -0
  177. package/vault/wiki/concepts/mcp-tool-routing.md +102 -0
  178. package/vault/wiki/concepts/memory-system-of-record-vs-ephemeral-cache.md +47 -0
  179. package/vault/wiki/concepts/meta-agent-context-pruning.md +151 -0
  180. package/vault/wiki/concepts/model-adaptive-harness.md +122 -0
  181. package/vault/wiki/concepts/model-routing-agents.md +101 -0
  182. package/vault/wiki/concepts/monorepo-architecture.md +45 -0
  183. package/vault/wiki/concepts/multi-agent-specialization.md +61 -0
  184. package/vault/wiki/concepts/permission-subsystem.md +16 -0
  185. package/vault/wiki/concepts/pi-messenger-analysis.md +243 -0
  186. package/vault/wiki/concepts/pi-vscode-extension-landscape.md +37 -0
  187. package/vault/wiki/concepts/policy-engine-pattern.md +78 -0
  188. package/vault/wiki/concepts/progressive-disclosure-agents.md +53 -0
  189. package/vault/wiki/concepts/progressive-skill-disclosure.md +17 -0
  190. package/vault/wiki/concepts/provider-native-prompting.md +203 -0
  191. package/vault/wiki/concepts/quality-signal-sentrux.md +37 -0
  192. package/vault/wiki/concepts/repo-map-ranking.md +42 -0
  193. package/vault/wiki/concepts/result-monad-error-handling.md +47 -0
  194. package/vault/wiki/concepts/safety-defense-in-depth.md +83 -0
  195. package/vault/wiki/concepts/sandbox-os-enforcement.md +18 -0
  196. package/vault/wiki/concepts/selective-debate-routing.md +70 -0
  197. package/vault/wiki/concepts/self-evolving-harness.md +60 -0
  198. package/vault/wiki/concepts/sentrux-mcp-integration.md +36 -0
  199. package/vault/wiki/concepts/sentrux-rules-engine.md +49 -0
  200. package/vault/wiki/concepts/shell-pattern-compression.md +24 -0
  201. package/vault/wiki/concepts/skill-first-architecture.md +166 -0
  202. package/vault/wiki/concepts/structured-compaction.md +78 -0
  203. package/vault/wiki/concepts/subagent-orchestration.md +17 -0
  204. package/vault/wiki/concepts/subagent-worktree-isolation.md +68 -0
  205. package/vault/wiki/concepts/superpowers-methodology.md +78 -0
  206. package/vault/wiki/concepts/think-in-code.md +73 -0
  207. package/vault/wiki/concepts/ts-execution-layer.md +100 -0
  208. package/vault/wiki/concepts/typescript-strict-mode.md +37 -0
  209. package/vault/wiki/concepts/vcc-conversation-compaction-for-pi.md +51 -0
  210. package/vault/wiki/concepts/verification-drift-detection.md +19 -0
  211. package/vault/wiki/consensus/consensus-records.md +58 -0
  212. package/vault/wiki/decisions/2026-04-30-pi-lean-ctx-native.md +122 -0
  213. package/vault/wiki/decisions/adr-008.md +40 -0
  214. package/vault/wiki/decisions/adr-009.md +46 -0
  215. package/vault/wiki/decisions/adr-010.md +55 -0
  216. package/vault/wiki/decisions/adr-011.md +165 -0
  217. package/vault/wiki/decisions/adr-012.md +102 -0
  218. package/vault/wiki/decisions/adr-013.md +59 -0
  219. package/vault/wiki/decisions/adr-014.md +73 -0
  220. package/vault/wiki/decisions/adr-015.md +81 -0
  221. package/vault/wiki/decisions/adr-016.md +91 -0
  222. package/vault/wiki/decisions/adr-017.md +79 -0
  223. package/vault/wiki/decisions/adr-018.md +100 -0
  224. package/vault/wiki/decisions/adr-019.md +75 -0
  225. package/vault/wiki/decisions/adr-020.md +106 -0
  226. package/vault/wiki/decisions/adr-021.md +86 -0
  227. package/vault/wiki/decisions/adr-022.md +113 -0
  228. package/vault/wiki/decisions/adr-023.md +113 -0
  229. package/vault/wiki/decisions/adr-024.md +73 -0
  230. package/vault/wiki/decisions/adr-025.md +130 -0
  231. package/vault/wiki/decisions/adr-026.md +56 -0
  232. package/vault/wiki/decisions/colocate-wiki.md +34 -0
  233. package/vault/wiki/entities/Anders Hejlsberg.md +29 -0
  234. package/vault/wiki/entities/Anthropic.md +17 -0
  235. package/vault/wiki/entities/Augment Code.md +49 -0
  236. package/vault/wiki/entities/Bjarne Stroustrup.md +26 -0
  237. package/vault/wiki/entities/Bolt.new (StackBlitz).md +39 -0
  238. package/vault/wiki/entities/Boris Cherny.md +11 -0
  239. package/vault/wiki/entities/Claude Code.md +19 -0
  240. package/vault/wiki/entities/Dennis Ritchie.md +26 -0
  241. package/vault/wiki/entities/Emergent Labs.md +32 -0
  242. package/vault/wiki/entities/Google Cloud.md +16 -0
  243. package/vault/wiki/entities/Guido van Rossum.md +28 -0
  244. package/vault/wiki/entities/Ken Thompson.md +28 -0
  245. package/vault/wiki/entities/Lee et al.md +16 -0
  246. package/vault/wiki/entities/Linus Torvalds.md +28 -0
  247. package/vault/wiki/entities/Lovable (company).md +40 -0
  248. package/vault/wiki/entities/Martin Fowler.md +16 -0
  249. package/vault/wiki/entities/Meng et al.md +16 -0
  250. package/vault/wiki/entities/OpenAI.md +16 -0
  251. package/vault/wiki/entities/Rocket.new.md +38 -0
  252. package/vault/wiki/entities/VILA-Lab.md +15 -0
  253. package/vault/wiki/entities/autodev-codebase.md +18 -0
  254. package/vault/wiki/entities/ck-tool.md +59 -0
  255. package/vault/wiki/entities/codesearch.md +18 -0
  256. package/vault/wiki/entities/disler-indydevdan.md +33 -0
  257. package/vault/wiki/entities/gsd-get-shit-done.md +56 -0
  258. package/vault/wiki/entities/javascript-runtimes.md +48 -0
  259. package/vault/wiki/entities/jesse-vincent.md +38 -0
  260. package/vault/wiki/entities/lean-ctx.md +32 -0
  261. package/vault/wiki/entities/opendev.md +41 -0
  262. package/vault/wiki/entities/ops-codegraph-tool.md +18 -0
  263. package/vault/wiki/entities/pi-coding-agent.md +53 -0
  264. package/vault/wiki/entities/sentrux.md +54 -0
  265. package/vault/wiki/entities/vgrep-tool.md +57 -0
  266. package/vault/wiki/entities/vitest.md +41 -0
  267. package/vault/wiki/flows/harness-wiki-pipeline.md +204 -0
  268. package/vault/wiki/hot.md +932 -0
  269. package/vault/wiki/index.md +437 -0
  270. package/vault/wiki/log.md +418 -0
  271. package/vault/wiki/meta/dashboard.md +30 -0
  272. package/vault/wiki/meta/lint-report-2026-04-30.md +86 -0
  273. package/vault/wiki/meta/lint-report-2026-05-02.md +251 -0
  274. package/vault/wiki/meta/overview.canvas +43 -0
  275. package/vault/wiki/modules/adversarial-verification.md +57 -0
  276. package/vault/wiki/modules/automated-observability.md +54 -0
  277. package/vault/wiki/modules/bench.md +20 -0
  278. package/vault/wiki/modules/extensions.md +23 -0
  279. package/vault/wiki/modules/grounding-checkpoints.md +62 -0
  280. package/vault/wiki/modules/harness-implementation-plan.md +345 -0
  281. package/vault/wiki/modules/harness-wiki-skill-mapping.md +135 -0
  282. package/vault/wiki/modules/harness.md +86 -0
  283. package/vault/wiki/modules/persistent-memory.md +85 -0
  284. package/vault/wiki/modules/schema-orchestration.md +68 -0
  285. package/vault/wiki/modules/skills.md +27 -0
  286. package/vault/wiki/modules/spec-hardening.md +58 -0
  287. package/vault/wiki/modules/structured-planning.md +53 -0
  288. package/vault/wiki/modules/think-in-code-enforcement.md +153 -0
  289. package/vault/wiki/modules/wiki-query-interface.md +64 -0
  290. package/vault/wiki/overview.md +51 -0
  291. package/vault/wiki/questions/Research-pi-vs-claude-code-agentic-orchestration-pipeline.md +87 -0
  292. package/vault/wiki/questions/Research-sentrux-dev.md +123 -0
  293. package/vault/wiki/questions/Research-superpowers-skill-for-agentic-coding-agents.md +164 -0
  294. package/vault/wiki/questions/Research: Augment Code Context Engine.md +244 -0
  295. package/vault/wiki/questions/Research: Automating Software Engineering - Lovable, Bolt, Emergent, Rocket.md +112 -0
  296. package/vault/wiki/questions/Research: Claude Code State-of-the-Art Harness Improvements.md +209 -0
  297. package/vault/wiki/questions/Research: Codex State-of-the-Art Harness Improvements.md +99 -0
  298. package/vault/wiki/questions/Research: Engineering Workflows of Legendary Programmers and AI Harness Mapping.md +107 -0
  299. package/vault/wiki/questions/Research: Fallow Codebase Intelligence Harness Integration.md +72 -0
  300. package/vault/wiki/questions/Research: Gemini CLI SOTA Harness Integration.md +166 -0
  301. package/vault/wiki/questions/Research: GitHub Issues as Harness Spec Storage.md +188 -0
  302. package/vault/wiki/questions/Research: Google Antigravity Harness Integration.md +120 -0
  303. package/vault/wiki/questions/Research: Meta-Agent Context Drift Detection.md +236 -0
  304. package/vault/wiki/questions/Research: Model-Adaptive Agent Harness Design.md +95 -0
  305. package/vault/wiki/questions/Research: Model-Specific Prompting Guides.md +165 -0
  306. package/vault/wiki/questions/Research: Prompt Renderer for Multi-Model Agent Harness.md +216 -0
  307. package/vault/wiki/questions/Research: Skill-First Harness Architecture.md +91 -0
  308. package/vault/wiki/questions/Research: TypeScript Best Practices and Codebase Structure.md +88 -0
  309. package/vault/wiki/questions/Research: TypeScript Execution Layer for Agent Tool Calling.md +81 -0
  310. package/vault/wiki/questions/Research: claude-mem over Obsidian for Harness Layer.md +71 -0
  311. package/vault/wiki/questions/Research: claude-mem over obsidian wiki as the knowledge base for our agentic harness pipeline. think from first principles. does this replace or complement our current setup? no hard feelings about previous decisions. gimme accurate points.md +80 -0
  312. package/vault/wiki/questions/Research: context-mode vs lean-ctx.md +72 -0
  313. package/vault/wiki/questions/Research: cursor.sh Harness Innovations.md +92 -0
  314. package/vault/wiki/questions/Research: executor.sh Harness Integration.md +170 -0
  315. package/vault/wiki/questions/Research: how GSD fits into our coding harness setup.md +97 -0
  316. package/vault/wiki/questions/Research: how claude-mem fits into our workflow. and whether it should replace obsidian in the codebase. no hard feelings about previous actions, rethink from first principles always.md +80 -0
  317. package/vault/wiki/questions/Research: pi-vcc.md +113 -0
  318. package/vault/wiki/questions/Research: semantic code search tools.md +69 -0
  319. package/vault/wiki/questions/Research: vcc extension for pi coding agent.md +73 -0
  320. package/vault/wiki/questions/how-to-enable-semantic-code-search-now.md +111 -0
  321. package/vault/wiki/questions/mvp-implementation-blueprint.md +552 -0
  322. package/vault/wiki/questions/research-agent-first-codebase-exploration.md +199 -0
  323. package/vault/wiki/questions/research-agentic-coding-harness-latest-papers.md +142 -0
  324. package/vault/wiki/questions/research-gitingest-gitreverse-integration.md +100 -0
  325. package/vault/wiki/questions/research-wozcode-token-reduction.md +67 -0
  326. package/vault/wiki/questions/resolved-context-pruning-inplace-vs-restart.md +95 -0
  327. package/vault/wiki/questions/resolved-context-window-economics.md +167 -0
  328. package/vault/wiki/questions/resolved-imad-debate-gating-transfer.md +126 -0
  329. package/vault/wiki/questions/resolved-mcp-tool-preference.md +112 -0
  330. package/vault/wiki/questions/resolved-small-model-meta-agents.md +107 -0
  331. package/vault/wiki/questions/resolved-treesitter-dynamic-languages.md +95 -0
  332. package/vault/wiki/sources/Auggie Context MCP Server.md +63 -0
  333. package/vault/wiki/sources/Augment Code Codacy AI Giants.md +61 -0
  334. package/vault/wiki/sources/Augment Code MCP SiliconAngle.md +49 -0
  335. package/vault/wiki/sources/Augment Code WorkOS ERC 2025.md +55 -0
  336. package/vault/wiki/sources/Augment Context Engine Official.md +71 -0
  337. package/vault/wiki/sources/Augment SWE-bench Agent GitHub.md +74 -0
  338. package/vault/wiki/sources/Augment SWE-bench Pro Blog.md +58 -0
  339. package/vault/wiki/sources/Source: AgentBus Jinja2 Prompt Pipelines.md +75 -0
  340. package/vault/wiki/sources/Source: Arxiv /342/200/224 Don't Break the Cache.md" +85 -0
  341. package/vault/wiki/sources/Source: Augment - Harness Engineering for AI Coding Agents.md +58 -0
  342. package/vault/wiki/sources/Source: Blake Crosley Agent Architecture Guide.md +100 -0
  343. package/vault/wiki/sources/Source: Bolt.new Architecture & Case Study.md +75 -0
  344. package/vault/wiki/sources/Source: Build-Time Prompt Compilation Architecture.md +107 -0
  345. package/vault/wiki/sources/Source: Claude API Agent Skills Overview.md +70 -0
  346. package/vault/wiki/sources/Source: Gemini CLI Changelogs.md +88 -0
  347. package/vault/wiki/sources/Source: Google Blog - Gemini CLI Announcement.md +57 -0
  348. package/vault/wiki/sources/Source: Google Gemini CLI Architecture Docs.md +53 -0
  349. package/vault/wiki/sources/Source: LangChain - Anatomy of Agent Harness.md +65 -0
  350. package/vault/wiki/sources/Source: Lovable Architecture & Clone Analysis.md +83 -0
  351. package/vault/wiki/sources/Source: Martin Fowler - Harness Engineering.md +70 -0
  352. package/vault/wiki/sources/Source: OpenAI Harness Engineering Five Principles.md +58 -0
  353. package/vault/wiki/sources/Source: OpenAI Harness Engineering /342/200/224 0 Lines of Human Code.md" +101 -0
  354. package/vault/wiki/sources/Source: OpenDev /342/200/224 Building AI Coding Agents for the Terminal.md" +100 -0
  355. package/vault/wiki/sources/Source: Render AI Coding Agents Benchmark 2025.md +53 -0
  356. package/vault/wiki/sources/Source: Rocket.new /342/200/224 Vibe Solutioning Platform.md" +70 -0
  357. package/vault/wiki/sources/Source: SwirlAI Agent Skills Progressive Disclosure.md +71 -0
  358. package/vault/wiki/sources/Source: TianPan Prompt Caching Architecture.md +89 -0
  359. package/vault/wiki/sources/Source: Vercel Labs agent-browser.md +155 -0
  360. package/vault/wiki/sources/Source: browser-harness CDP Harness.md +126 -0
  361. package/vault/wiki/sources/agent-drift-academic-paper.md +79 -0
  362. package/vault/wiki/sources/aider-repomap-tree-sitter.md +42 -0
  363. package/vault/wiki/sources/anthropic-compaction-api.md +58 -0
  364. package/vault/wiki/sources/anthropic-effective-harnesses.md +42 -0
  365. package/vault/wiki/sources/anthropic-prompt-best-practices.md +100 -0
  366. package/vault/wiki/sources/anthropic2026-harness-design.md +63 -0
  367. package/vault/wiki/sources/barrel-files-tkdodo.md +38 -0
  368. package/vault/wiki/sources/birth-of-unix-kernighan-interview.md +57 -0
  369. package/vault/wiki/sources/bockeler2026-harness-engineering.md +69 -0
  370. package/vault/wiki/sources/cast-code-chunking-paper.md +50 -0
  371. package/vault/wiki/sources/ck-semantic-search.md +78 -0
  372. package/vault/wiki/sources/claude-code-architecture-karaxai-2026.md +71 -0
  373. package/vault/wiki/sources/claude-code-architecture-qubytes-2026.md +50 -0
  374. package/vault/wiki/sources/claude-code-architecture-vila-lab-2026.md +64 -0
  375. package/vault/wiki/sources/claude-code-security-architecture-penligent-2026.md +70 -0
  376. package/vault/wiki/sources/claude-context-editing-docs.md +13 -0
  377. package/vault/wiki/sources/cloudflare-codemode.md +63 -0
  378. package/vault/wiki/sources/code-chunk-library-supermemory.md +63 -0
  379. package/vault/wiki/sources/codeact-apple-2024.md +62 -0
  380. package/vault/wiki/sources/codex-dsc-rfc-8573.md +41 -0
  381. package/vault/wiki/sources/codex-open-source-agent-2026.md +110 -0
  382. package/vault/wiki/sources/coir-code-retrieval-benchmark.md +51 -0
  383. package/vault/wiki/sources/colinmcnamara-context-optimization-codemode.md +48 -0
  384. package/vault/wiki/sources/context-folding-paper.md +61 -0
  385. package/vault/wiki/sources/context-mode-website.md +63 -0
  386. package/vault/wiki/sources/cursor-agent-best-practices-2026.md +62 -0
  387. package/vault/wiki/sources/cursor-fork-29b-2025.md +50 -0
  388. package/vault/wiki/sources/cursor-harness-april-2026.md +76 -0
  389. package/vault/wiki/sources/cursor-instant-apply-2024.md +45 -0
  390. package/vault/wiki/sources/cursor-shadow-workspace-2024.md +52 -0
  391. package/vault/wiki/sources/cursor-shipped-coding-agent-2026.md +53 -0
  392. package/vault/wiki/sources/cursor-vs-antigravity-2026.md +51 -0
  393. package/vault/wiki/sources/disler-pi-vs-claude-code.md +69 -0
  394. package/vault/wiki/sources/distill-deterministic-context-compression.md +53 -0
  395. package/vault/wiki/sources/embedding-models-benchmark-supermemory-2025.md +48 -0
  396. package/vault/wiki/sources/executor-rhyssullivan.md +122 -0
  397. package/vault/wiki/sources/fallow-rs-codebase-intelligence.md +125 -0
  398. package/vault/wiki/sources/fan2025-imad.md +60 -0
  399. package/vault/wiki/sources/forgecode-gpt5-agent-improvements.md +63 -0
  400. package/vault/wiki/sources/gemini-3-prompting-guide.md +78 -0
  401. package/vault/wiki/sources/gh-cli-sub-issue-rfc.md +50 -0
  402. package/vault/wiki/sources/gh-sub-issue-extension.md +72 -0
  403. package/vault/wiki/sources/github-fork-issues-discussion.md +44 -0
  404. package/vault/wiki/sources/github-issue-dependencies-docs.md +49 -0
  405. package/vault/wiki/sources/github-sub-issues-docs.md +51 -0
  406. package/vault/wiki/sources/gitingest.md +91 -0
  407. package/vault/wiki/sources/gitreverse.md +63 -0
  408. package/vault/wiki/sources/google-antigravity-official-blog.md +47 -0
  409. package/vault/wiki/sources/google-antigravity-wikipedia.md +53 -0
  410. package/vault/wiki/sources/gsd-codecentric-deep-dive.md +57 -0
  411. package/vault/wiki/sources/gsd-github-repo.md +51 -0
  412. package/vault/wiki/sources/gsd-hn-discussion.md +59 -0
  413. package/vault/wiki/sources/guido-python-design-philosophy.md +56 -0
  414. package/vault/wiki/sources/hejlsberg-7-learnings.md +48 -0
  415. package/vault/wiki/sources/ironclaw-drift-monitor.md +80 -0
  416. package/vault/wiki/sources/langsight-loop-detection.md +80 -0
  417. package/vault/wiki/sources/leanctx-website.md +69 -0
  418. package/vault/wiki/sources/lee2026-meta-harness.md +59 -0
  419. package/vault/wiki/sources/linux-kernel-coding-workflow.md +50 -0
  420. package/vault/wiki/sources/lou2026-autoharness.md +53 -0
  421. package/vault/wiki/sources/martin-fowler-harness-engineering.md +73 -0
  422. package/vault/wiki/sources/mcp-architecture-docs.md +13 -0
  423. package/vault/wiki/sources/meng2026-agent-harness-survey.md +79 -0
  424. package/vault/wiki/sources/mindstudio-four-agent-types.md +68 -0
  425. package/vault/wiki/sources/ms-chat-history-management.md +13 -0
  426. package/vault/wiki/sources/openai-prompt-guidance.md +104 -0
  427. package/vault/wiki/sources/openclaw-session-pruning.md +13 -0
  428. package/vault/wiki/sources/opencode-dcp.md +13 -0
  429. package/vault/wiki/sources/opendev-arxiv-2603.05344v1.md +79 -0
  430. package/vault/wiki/sources/openhands-platform.md +39 -0
  431. package/vault/wiki/sources/oss-guide-codebase-exploration.md +53 -0
  432. package/vault/wiki/sources/pi-compaction-extensions-ecosystem.md +102 -0
  433. package/vault/wiki/sources/pi-context-prune-github-repo.md +38 -0
  434. package/vault/wiki/sources/pi-mono-compaction-docs.md +38 -0
  435. package/vault/wiki/sources/pi-omni-compact-github-repo.md +50 -0
  436. package/vault/wiki/sources/pi-rtk-optimizer-github-repo.md +45 -0
  437. package/vault/wiki/sources/pi-vcc-github-repo.md +69 -0
  438. package/vault/wiki/sources/pi-vscode-marketplace.md +41 -0
  439. package/vault/wiki/sources/pi-vscode-model-provider-marketplace.md +39 -0
  440. package/vault/wiki/sources/py-tree-sitter.md +13 -0
  441. package/vault/wiki/sources/sentrux-dev-landing.md +40 -0
  442. package/vault/wiki/sources/sentrux-docs-pro-architecture.md +75 -0
  443. package/vault/wiki/sources/sentrux-docs-quality-signal.md +46 -0
  444. package/vault/wiki/sources/sentrux-docs-root-cause-metrics.md +57 -0
  445. package/vault/wiki/sources/sentrux-docs-rules-engine.md +58 -0
  446. package/vault/wiki/sources/sentrux-github-repo.md +56 -0
  447. package/vault/wiki/sources/superpowers-github-repo.md +56 -0
  448. package/vault/wiki/sources/superpowers-release-blog.md +54 -0
  449. package/vault/wiki/sources/superpowers-termdock-analysis.md +45 -0
  450. package/vault/wiki/sources/swe-agent-aci.md +42 -0
  451. package/vault/wiki/sources/swe-bench.md +45 -0
  452. package/vault/wiki/sources/swe-pruner-context-pruning.md +13 -0
  453. package/vault/wiki/sources/think-in-code-blog.md +48 -0
  454. package/vault/wiki/sources/tree-sitter-docs.md +13 -0
  455. package/vault/wiki/sources/ts-best-practices-2025-devto.md +42 -0
  456. package/vault/wiki/sources/ts-folder-structure-mingyang.md +58 -0
  457. package/vault/wiki/sources/ts-monorepo-koerselman.md +44 -0
  458. package/vault/wiki/sources/ts-result-error-handling-kkalamarski.md +52 -0
  459. package/vault/wiki/sources/ts-runtimes-comparison-betterstack.md +42 -0
  460. package/vault/wiki/sources/ts-strict-mode-rishikc.md +43 -0
  461. package/vault/wiki/sources/unix-philosophy.md +48 -0
  462. package/vault/wiki/sources/vectara-chunking-vs-embedding-naacl2025.md +39 -0
  463. package/vault/wiki/sources/vectara-guardian-agents.md +79 -0
  464. package/vault/wiki/sources/vgrep-semantic-search.md +76 -0
  465. package/vault/wiki/sources/vitest-official.md +41 -0
  466. package/vault/wiki/sources/vscode-pi-community-extension.md +40 -0
  467. package/vault/wiki/sources/wozcode.md +79 -0
  468. package/.agents/skills/compress/SKILL.md +0 -111
  469. package/.agents/skills/compress/scripts/__init__.py +0 -9
  470. package/.agents/skills/compress/scripts/__main__.py +0 -3
  471. package/.agents/skills/compress/scripts/benchmark.py +0 -78
  472. package/.agents/skills/compress/scripts/cli.py +0 -73
  473. package/.agents/skills/compress/scripts/compress.py +0 -227
  474. package/.agents/skills/compress/scripts/detect.py +0 -121
  475. package/.agents/skills/compress/scripts/validate.py +0 -189
  476. package/.agents/skills/emil-design-eng/SKILL.md +0 -679
  477. package/.agents/skills/lean-ctx/SKILL.md +0 -149
  478. package/.agents/skills/lean-ctx/scripts/install.sh +0 -95
  479. package/.agents/skills/scrapling-official/LICENSE.txt +0 -28
  480. package/.agents/skills/scrapling-official/SKILL.md +0 -390
  481. package/.agents/skills/scrapling-official/examples/01_fetcher_session.py +0 -26
  482. package/.agents/skills/scrapling-official/examples/02_dynamic_session.py +0 -26
  483. package/.agents/skills/scrapling-official/examples/03_stealthy_session.py +0 -26
  484. package/.agents/skills/scrapling-official/examples/04_spider.py +0 -58
  485. package/.agents/skills/scrapling-official/examples/README.md +0 -45
  486. package/.agents/skills/scrapling-official/references/fetching/choosing.md +0 -78
  487. package/.agents/skills/scrapling-official/references/fetching/dynamic.md +0 -352
  488. package/.agents/skills/scrapling-official/references/fetching/static.md +0 -432
  489. package/.agents/skills/scrapling-official/references/fetching/stealthy.md +0 -255
  490. package/.agents/skills/scrapling-official/references/mcp-server.md +0 -214
  491. package/.agents/skills/scrapling-official/references/migrating_from_beautifulsoup.md +0 -86
  492. package/.agents/skills/scrapling-official/references/parsing/adaptive.md +0 -212
  493. package/.agents/skills/scrapling-official/references/parsing/main_classes.md +0 -586
  494. package/.agents/skills/scrapling-official/references/parsing/selection.md +0 -494
  495. package/.agents/skills/scrapling-official/references/spiders/advanced.md +0 -344
  496. package/.agents/skills/scrapling-official/references/spiders/architecture.md +0 -94
  497. package/.agents/skills/scrapling-official/references/spiders/getting-started.md +0 -164
  498. package/.agents/skills/scrapling-official/references/spiders/proxy-blocking.md +0 -235
  499. package/.agents/skills/scrapling-official/references/spiders/requests-responses.md +0 -196
  500. package/.agents/skills/scrapling-official/references/spiders/sessions.md +0 -205
  501. package/.github/banner.png +0 -0
  502. package/PLAN.md +0 -11
  503. package/extensions/lean-ctx-enforce.ts +0 -166
  504. package/skills-lock.json +0 -35
  505. package/wiki/README.md +0 -10
  506. package/wiki/decisions/0001-establish-project-wiki-and-decision-record-format.md +0 -25
  507. package/wiki/decisions/0002-add-project-banner-to-readme.md +0 -26
  508. package/wiki/decisions/0003-remove-redundant-readme-title-heading.md +0 -26
  509. package/wiki/decisions/0004-publish-package-to-npm-as-ultimate-pi.md +0 -26
@@ -1,255 +0,0 @@
1
- # StealthyFetcher
2
-
3
- `StealthyFetcher` is a stealthy browser-based fetcher similar to [DynamicFetcher](dynamic.md), using [Playwright's API](https://playwright.dev/python/docs/intro). It adds advanced anti-bot protection bypass capabilities, most handled automatically. It shares the same browser automation model as `DynamicFetcher`, using [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) for page interaction.
4
-
5
- ## Basic Usage
6
- You have one primary way to import this Fetcher, which is the same for all fetchers.
7
-
8
- ```python
9
- >>> from scrapling.fetchers import StealthyFetcher
10
- ```
11
- Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
12
-
13
- **Note:** The async version of the `fetch` method is `async_fetch`.
14
-
15
- ## What does it do?
16
-
17
- The `StealthyFetcher` class is a stealthy version of the [DynamicFetcher](dynamic.md) class, and here are some of the things it does:
18
-
19
- 1. It easily bypasses all types of Cloudflare's Turnstile/Interstitial automatically.
20
- 2. It bypasses CDP runtime leaks and WebRTC leaks.
21
- 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
22
- 4. It generates canvas noise to prevent fingerprinting through canvas.
23
- 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
24
- 6. and other anti-protection options...
25
-
26
- ## Full list of arguments
27
- Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
28
-
29
-
30
- | Argument | Description | Optional |
31
- |:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
32
- | url | Target url | ❌ |
33
- | headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
34
- | disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | ✔️ |
35
- | cookies | Set cookies for the next request. | ✔️ |
36
- | useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and version.** | ✔️ |
37
- | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
38
- | load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
39
- | timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
40
- | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
41
- | page_action | Added for automation. Pass a function that takes the `page` object, runs after navigation, and does the necessary automation. | ✔️ |
42
- | page_setup | A function that takes the `page` object, runs before navigation. Use it to register event listeners or routes that must be set up before the page loads. | ✔️ |
43
- | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
44
- | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
45
- | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
46
- | google_search | Enabled by default, Scrapling will set a Google referer header. | ✔️ |
47
- | extra_headers | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._ | ✔️ |
48
- | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
49
- | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
50
- | locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
51
- | timezone_id | Changes the timezone of the browser. Defaults to the system timezone. | ✔️ |
52
- | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
53
- | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
54
- | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
55
- | solve_cloudflare | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you. | ✔️ |
56
- | block_webrtc | Forces WebRTC to respect proxy settings to prevent local IP address leak. | ✔️ |
57
- | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
58
- | allow_webgl | Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
59
- | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
60
- | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
61
- | blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
62
- | block_ads | Block requests to ~3,500 known ad/tracking domains. Can be combined with `blocked_domains`. | ✔️ |
63
- | dns_over_https | Route DNS queries through Cloudflare's DNS-over-HTTPS to prevent DNS leaks when using proxies. | ✔️ |
64
- | proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
65
- | retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
66
- | retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
67
- | capture_xhr | Pass a regex URL pattern string to capture XHR/fetch requests matching it during page load. Captured responses are available via `response.captured_xhr`. Defaults to `None` (disabled). | ✔️ |
68
- | executable_path | Absolute path to a custom browser executable to use instead of the bundled Chromium. Useful for non-standard installations or custom browser builds. | ✔️ |
69
-
70
- In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `page_setup`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, `blocked_domains`, `proxy`, and `selector_config`.
71
-
72
- **Notes:**
73
-
74
- 1. It's basically the same arguments as [DynamicFetcher](dynamic.md) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
75
- 2. The `disable_resources` option made requests ~25% faster in tests for some websites and can help save proxy usage, but be careful with it, as it can cause some websites to never finish loading.
76
- 3. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
77
- 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
78
-
79
- ## Examples
80
-
81
- ### Cloudflare and stealth options
82
-
83
- ```python
84
- # Automatic Cloudflare solver
85
- page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True)
86
-
87
- # Works with other stealth options
88
- page = StealthyFetcher.fetch(
89
- 'https://protected-site.com',
90
- solve_cloudflare=True,
91
- block_webrtc=True,
92
- real_chrome=True,
93
- hide_canvas=True,
94
- google_search=True,
95
- proxy='http://username:password@host:port', # It can also be a dictionary with only the keys 'server', 'username', and 'password'.
96
- )
97
- ```
98
-
99
- The `solve_cloudflare` parameter enables automatic detection and solving all types of Cloudflare's Turnstile/Interstitial challenges:
100
-
101
- - JavaScript challenges (managed)
102
- - Interactive challenges (clicking verification boxes)
103
- - Invisible challenges (automatic background verification)
104
-
105
- And even solves the custom pages with embedded captcha.
106
-
107
- **Important notes:**
108
-
109
- 1. Sometimes, with websites that use custom implementations, you will need to use `wait_selector` to make sure Scrapling waits for the real website content to be loaded after solving the captcha. Some websites can be the real definition of an edge case while we are trying to make the solver as generic as possible.
110
- 2. The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time.
111
- 3. This feature works seamlessly with proxies and other stealth options.
112
-
113
- ### Browser Automation
114
- This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
115
-
116
- This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for purposes beyond automation. You can alter the page as you want.
117
-
118
- In the example below, I used the pages' [mouse events](https://playwright.dev/python/docs/api/class-mouse) to scroll the page with the mouse wheel, then move the mouse.
119
- ```python
120
- from playwright.sync_api import Page
121
-
122
- def scroll_page(page: Page):
123
- page.mouse.wheel(10, 0)
124
- page.mouse.move(100, 400)
125
- page.mouse.up()
126
-
127
- page = StealthyFetcher.fetch('https://example.com', page_action=scroll_page)
128
- ```
129
- Of course, if you use the async fetch version, the function must also be async.
130
- ```python
131
- from playwright.async_api import Page
132
-
133
- async def scroll_page(page: Page):
134
- await page.mouse.wheel(10, 0)
135
- await page.mouse.move(100, 400)
136
- await page.mouse.up()
137
-
138
- page = await StealthyFetcher.async_fetch('https://example.com', page_action=scroll_page)
139
- ```
140
-
141
- ### Wait Conditions
142
- ```python
143
- # Wait for the selector
144
- page = StealthyFetcher.fetch(
145
- 'https://example.com',
146
- wait_selector='h1',
147
- wait_selector_state='visible'
148
- )
149
- ```
150
- This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
151
-
152
- After that, if `load_dom` is enabled (the default), the fetcher will check again to see if all JavaScript files are loaded and executed (in the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
153
-
154
- The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
155
-
156
- - `attached`: Wait for an element to be present in the DOM.
157
- - `detached`: Wait for an element to not be present in the DOM.
158
- - `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
159
- - `hidden`: wait for an element to be either detached from the DOM, or have an empty bounding box, or `visibility:hidden`. This is opposite to the `'visible'` option.
160
-
161
-
162
- ### Real-world example (Amazon)
163
- This is for educational purposes only; this example was generated by AI, which also shows how easy it is to work with Scrapling through AI
164
- ```python
165
- def scrape_amazon_product(url):
166
- # Use StealthyFetcher to bypass protection
167
- page = StealthyFetcher.fetch(url)
168
-
169
- # Extract product details
170
- return {
171
- 'title': page.css('#productTitle::text').get().clean(),
172
- 'price': page.css('.a-price .a-offscreen::text').get(),
173
- 'rating': page.css('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text').get(),
174
- 'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'),
175
- 'features': [
176
- li.get().clean() for li in page.css('#feature-bullets li span::text')
177
- ],
178
- 'availability': page.css('#availability')[0].get_all_text(strip=True),
179
- 'images': [
180
- img.attrib['src'] for img in page.css('#altImages img')
181
- ]
182
- }
183
- ```
184
-
185
- ## Session Management
186
-
187
- To keep the browser open until you make multiple requests with the same configuration, use `StealthySession`/`AsyncStealthySession` classes. Those classes can accept all the arguments that the `fetch` function can take, which enables you to specify a config for the entire session.
188
-
189
- ```python
190
- from scrapling.fetchers import StealthySession
191
-
192
- # Create a session with default configuration
193
- with StealthySession(
194
- headless=True,
195
- real_chrome=True,
196
- block_webrtc=True,
197
- solve_cloudflare=True
198
- ) as session:
199
- # Make multiple requests with the same browser instance
200
- page1 = session.fetch('https://example1.com')
201
- page2 = session.fetch('https://example2.com')
202
- page3 = session.fetch('https://nopecha.com/demo/cloudflare')
203
-
204
- # All requests reuse the same tab on the same browser instance
205
- ```
206
-
207
- ### Async Session Usage
208
-
209
- ```python
210
- import asyncio
211
- from scrapling.fetchers import AsyncStealthySession
212
-
213
- async def scrape_multiple_sites():
214
- async with AsyncStealthySession(
215
- real_chrome=True,
216
- block_webrtc=True,
217
- solve_cloudflare=True,
218
- timeout=60000, # 60 seconds for Cloudflare challenges
219
- max_pages=3
220
- ) as session:
221
- # Make async requests with shared browser configuration
222
- pages = await asyncio.gather(
223
- session.fetch('https://site1.com'),
224
- session.fetch('https://site2.com'),
225
- session.fetch('https://protected-site.com')
226
- )
227
- return pages
228
- ```
229
-
230
- You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages that can be displayed at once. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
231
-
232
- 1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
233
- 2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
234
-
235
- This logic allows for multiple URLs to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
236
-
237
- In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
238
-
239
- ### Session Benefits
240
-
241
- - **Browser reuse**: Much faster subsequent requests by reusing the same browser instance.
242
- - **Cookie persistence**: Automatic cookie and session state handling as any browser does automatically.
243
- - **Consistent fingerprint**: Same browser fingerprint across all requests.
244
- - **Memory efficiency**: Better resource usage compared to launching new browsers with each fetch.
245
-
246
- ## When to Use
247
-
248
- Use StealthyFetcher when:
249
-
250
- - Bypassing anti-bot protection
251
- - Need a reliable browser fingerprint
252
- - Full JavaScript support needed
253
- - Want automatic stealth features
254
- - Need browser automation
255
- - Dealing with Cloudflare protection
@@ -1,214 +0,0 @@
1
- # Scrapling MCP Server
2
-
3
- The Scrapling MCP server exposes ten tools over the MCP protocol. It supports CSS-selector-based content narrowing (reducing tokens by extracting only relevant elements before returning results), three levels of scraping capability (plain HTTP, browser-rendered, and stealth/anti-bot bypass), persistent browser session management, and page screenshots returned as real image content blocks.
4
-
5
- All scraping tools return a `ResponseModel` with fields: `status` (int), `content` (list of strings), `url` (str). The `screenshot` tool returns a list of MCP content blocks: an `ImageContent` (the screenshot bytes) followed by a `TextContent` (the post-redirect URL).
6
-
7
- ## Tools
8
-
9
- ### `get` -- HTTP request (single URL)
10
-
11
- Fast HTTP GET with browser fingerprint impersonation (TLS, headers). Suitable for static pages with no/low bot protection.
12
-
13
- **Key parameters:**
14
-
15
- | Parameter | Type | Default | Description |
16
- |---------------------|------------------------------------|--------------|--------------------------------------------------------------------|
17
- | `url` | str | required | URL to fetch |
18
- | `extraction_type` | `"markdown"` / `"html"` / `"text"` | `"markdown"` | Output format |
19
- | `css_selector` | str or null | null | CSS selector to narrow content (applied after `main_content_only`) |
20
- | `main_content_only` | bool | true | Restrict to `<body>` content |
21
- | `impersonate` | str | `"chrome"` | Browser fingerprint to impersonate |
22
- | `proxy` | str or null | null | Proxy URL, e.g. `"http://user:pass@host:port"` |
23
- | `proxy_auth` | dict or null | null | `{"username": "...", "password": "..."}` |
24
- | `auth` | dict or null | null | HTTP basic auth, same format as proxy_auth |
25
- | `timeout` | number | 30 | Seconds before timeout |
26
- | `retries` | int | 3 | Retry attempts on failure |
27
- | `retry_delay` | int | 1 | Seconds between retries |
28
- | `stealthy_headers` | bool | true | Generate realistic browser headers and Google referer |
29
- | `http3` | bool | false | Use HTTP/3 (may conflict with `impersonate`) |
30
- | `follow_redirects` | bool or "safe" | "safe" | Follow redirects. "safe" rejects redirects to internal/private IPs |
31
- | `max_redirects` | int | 30 | Max redirects (-1 for unlimited) |
32
- | `headers` | dict or null | null | Custom request headers |
33
- | `cookies` | dict or null | null | Request cookies |
34
- | `params` | dict or null | null | Query string parameters |
35
- | `verify` | bool | true | Verify HTTPS certificates |
36
-
37
- ### `bulk_get` -- HTTP request (multiple URLs)
38
-
39
- Async concurrent version of `get`. Same parameters except `url` is replaced by `urls` (list of strings). All URLs are fetched in parallel. Returns a list of `ResponseModel`.
40
-
41
- ### `fetch` -- Browser fetch (single URL)
42
-
43
- Opens a Chromium browser via Playwright to render JavaScript. Suitable for dynamic/SPA pages with no/low bot protection.
44
-
45
- **Key parameters (beyond shared ones):**
46
-
47
- | Parameter | Type | Default | Description |
48
- |-----------------------|---------------------|--------------|---------------------------------------------------------------------------------|
49
- | `url` | str | required | URL to fetch |
50
- | `extraction_type` | str | `"markdown"` | `"markdown"` / `"html"` / `"text"` |
51
- | `css_selector` | str or null | null | Narrow content before extraction |
52
- | `main_content_only` | bool | true | Restrict to `<body>` |
53
- | `headless` | bool | true | Run browser hidden (true) or visible (false) |
54
- | `proxy` | str or dict or null | null | String URL or `{"server": "...", "username": "...", "password": "..."}` |
55
- | `timeout` | number | 30000 | Timeout in **milliseconds** |
56
- | `wait` | number | 0 | Extra wait (ms) after page load before extraction |
57
- | `wait_selector` | str or null | null | CSS selector to wait for before extraction |
58
- | `wait_selector_state` | str | `"attached"` | State for wait_selector: `"attached"` / `"visible"` / `"hidden"` / `"detached"` |
59
- | `network_idle` | bool | false | Wait until no network activity for 500ms |
60
- | `disable_resources` | bool | false | Block fonts, images, media, stylesheets, etc. for speed |
61
- | `google_search` | bool | true | Set a Google referer header |
62
- | `real_chrome` | bool | false | Use locally installed Chrome instead of bundled Chromium |
63
- | `cdp_url` | str or null | null | Connect to existing browser via CDP URL |
64
- | `extra_headers` | dict or null | null | Additional request headers |
65
- | `useragent` | str or null | null | Custom user-agent (auto-generated if null) |
66
- | `cookies` | list or null | null | Playwright-format cookies |
67
- | `timezone_id` | str or null | null | Browser timezone, e.g. `"America/New_York"` |
68
- | `locale` | str or null | null | Browser locale, e.g. `"en-GB"` |
69
- | `session_id` | str or null | null | Reuse a persistent session from `open_session` instead of creating a new browser |
70
-
71
- ### `bulk_fetch` -- Browser fetch (multiple URLs)
72
-
73
- Concurrent browser version of `fetch`. Same parameters (including `session_id`) except `url` is replaced by `urls` (list of strings). Each URL opens in a separate browser tab. Returns a list of `ResponseModel`.
74
-
75
- ### `stealthy_fetch` -- Stealth browser fetch (single URL)
76
-
77
- Anti-bot bypass fetcher with fingerprint spoofing. Use this for sites with Cloudflare Turnstile/Interstitial or other strong protections.
78
-
79
- **Additional parameters (beyond those in `fetch`):**
80
-
81
- | Parameter | Type | Default | Description |
82
- |--------------------|--------------|---------|------------------------------------------------------------------|
83
- | `solve_cloudflare` | bool | false | Automatically solve Cloudflare Turnstile/Interstitial challenges |
84
- | `hide_canvas` | bool | false | Add noise to canvas operations to prevent fingerprinting |
85
- | `block_webrtc` | bool | false | Force WebRTC to respect proxy settings (prevents IP leak) |
86
- | `allow_webgl` | bool | true | Keep WebGL enabled (disabling is detectable by WAFs) |
87
- | `additional_args` | dict or null | null | Extra Playwright context args (overrides Scrapling defaults) |
88
- | `session_id` | str or null | null | Reuse a persistent stealthy session from `open_session` |
89
-
90
- All parameters from `fetch` are also accepted.
91
-
92
- ### `bulk_stealthy_fetch` -- Stealth browser fetch (multiple URLs)
93
-
94
- Concurrent stealth version. Same parameters (including `session_id`) as `stealthy_fetch` except `url` is replaced by `urls` (list of strings). Returns a list of `ResponseModel`.
95
-
96
- ### `open_session` -- Create a persistent browser session
97
-
98
- Opens a browser session that stays alive across multiple fetch calls, avoiding the overhead of launching a new browser each time. Returns a `SessionCreatedModel` with `session_id`, `session_type`, `created_at`, `is_alive`, and `message`.
99
-
100
- **Key parameters:**
101
-
102
- | Parameter | Type | Default | Description |
103
- |--------------------|-----------------------------|--------------|-------------------------------------------------------------------------------------------------------|
104
- | `session_type` | `"dynamic"` / `"stealthy"` | required | Type of browser session to create |
105
- | `session_id` | str or null | null | Custom ID for the session. If omitted, a random 12-char hex ID is generated. Raises if already in use |
106
- | `headless` | bool | true | Run browser hidden or visible |
107
- | `max_pages` | int | 5 | Max concurrent browser tabs (1-50) |
108
- | `proxy` | str or dict or null | null | Proxy for all requests in this session |
109
- | `timeout` | number | 30000 | Default timeout in ms |
110
- | `solve_cloudflare` | bool | false | (Stealthy only) Auto-solve Cloudflare challenges |
111
- | `hide_canvas` | bool | false | (Stealthy only) Canvas fingerprint noise |
112
- | `block_webrtc` | bool | false | (Stealthy only) Block WebRTC IP leak |
113
- | `allow_webgl` | bool | true | (Stealthy only) Keep WebGL enabled |
114
-
115
- Plus all other browser session parameters (`google_search`, `real_chrome`, `cdp_url`, `locale`, `timezone_id`, `useragent`, `extra_headers`, `cookies`, `disable_resources`, `network_idle`, `wait_selector`, `wait_selector_state`).
116
-
117
- A dynamic session can only be used with `fetch`/`bulk_fetch`. A stealthy session can only be used with `stealthy_fetch`/`bulk_stealthy_fetch`.
118
-
119
- ### `close_session` -- Close a persistent browser session
120
-
121
- Closes a session and frees its browser resources. Always close sessions when done.
122
-
123
- | Parameter | Type | Default | Description |
124
- |--------------|------|----------|----------------------------------|
125
- | `session_id` | str | required | Session ID from `open_session` |
126
-
127
- Returns a `SessionClosedModel` with `session_id` and `message`.
128
-
129
- ### `list_sessions` -- List active sessions
130
-
131
- Returns a list of `SessionInfo` objects, each with `session_id`, `session_type`, `created_at`, and `is_alive`.
132
-
133
- No parameters.
134
-
135
- ### `screenshot` -- Capture a page screenshot
136
-
137
- Navigates to a URL inside an existing browser session and returns the screenshot as an MCP `ImageContent` block (the bytes the model can see directly, not a base64 string in JSON) followed by a `TextContent` block carrying the post-redirect URL.
138
-
139
- Requires an open browser session. Call `open_session` first, then pass the `session_id` here. Both `dynamic` and `stealthy` sessions are accepted.
140
-
141
- | Parameter | Type | Default | Description |
142
- |-----------------------|-----------------------|--------------|--------------------------------------------------------------------------------------|
143
- | `url` | str | required | URL to navigate to and capture |
144
- | `session_id` | str | required | ID of an open browser session created with `open_session` |
145
- | `image_type` | `"png"` / `"jpeg"` | `"png"` | Image format. Use `"jpeg"` for smaller payloads |
146
- | `full_page` | bool | false | Capture the full scrollable page instead of just the viewport |
147
- | `quality` | int or null | null | JPEG quality 0-100. Raises if passed with `image_type="png"` |
148
- | `wait` | number | 0 | Extra wait (ms) after page load before capture |
149
- | `wait_selector` | str or null | null | CSS selector to wait for before capture |
150
- | `wait_selector_state` | str | `"attached"` | State for `wait_selector`: `"attached"` / `"visible"` / `"hidden"` / `"detached"` |
151
- | `network_idle` | bool | false | Wait until no network activity for 500ms |
152
- | `timeout` | number | 30000 | Timeout in milliseconds |
153
-
154
- ## Tool selection guide
155
-
156
- | Scenario | Tool |
157
- |------------------------------------------|---------------------------------------------------------------|
158
- | Static page, no bot protection | `get` |
159
- | Multiple static pages | `bulk_get` |
160
- | JavaScript-rendered / SPA page | `fetch` |
161
- | Multiple JS-rendered pages | `bulk_fetch` |
162
- | Cloudflare or strong anti-bot protection | `stealthy_fetch` (with `solve_cloudflare=true` for Turnstile) |
163
- | Multiple protected pages | `bulk_stealthy_fetch` |
164
- | Multiple pages from the same site | `open_session` + `fetch`/`stealthy_fetch` with `session_id` |
165
- | Need a screenshot of a page | `open_session` + `screenshot` with `session_id` |
166
-
167
- Start with `get` (fastest, lowest resource cost). Escalate to `fetch` if content requires JS rendering. Escalate to `stealthy_fetch` only if blocked. For multiple pages from the same site, use a persistent session to avoid browser launch overhead.
168
-
169
- ## Content extraction tips
170
-
171
- - Use `css_selector` to narrow results before they reach the model -- this saves significant tokens.
172
- - `main_content_only=true` (default) strips nav/footer by restricting to `<body>`.
173
- - `extraction_type="markdown"` (default) is best for readability. Use `"text"` for minimal output, `"html"` when structure matters.
174
- - If a `css_selector` matches multiple elements, all are returned in the `content` list.
175
-
176
- ## Prompt injection protection
177
-
178
- When `main_content_only=true` (the default), the server automatically sanitizes scraped content to prevent prompt injection from malicious websites. It strips:
179
-
180
- - CSS-hidden elements (`display:none`, `visibility:hidden`, `opacity:0`, `font-size:0`, `height:0`, `width:0`)
181
- - `aria-hidden="true"` elements
182
- - `<template>` tags
183
- - HTML comments
184
- - Zero-width unicode characters
185
-
186
- Keep `main_content_only=true` for maximum protection.
187
-
188
- ## Ad blocking
189
-
190
- All browser-based tools (`fetch`, `bulk_fetch`, `stealthy_fetch`, `bulk_stealthy_fetch`) and persistent sessions (`open_session`) automatically block requests to ~3,500 known ad and tracker domains. This is always enabled in the MCP server to save tokens and speed up page loads. No configuration needed.
191
-
192
- ## Setup
193
-
194
- Start the server (stdio transport, used by most MCP clients):
195
-
196
- ```bash
197
- scrapling mcp
198
- ```
199
-
200
- Or with Streamable HTTP transport:
201
-
202
- ```bash
203
- scrapling mcp --http
204
- scrapling mcp --http --host 127.0.0.1 --port 8000
205
- ```
206
-
207
- Docker alternative:
208
-
209
- ```bash
210
- docker pull pyd4vinci/scrapling
211
- docker run -i --rm scrapling mcp
212
- ```
213
-
214
- The MCP server name when registering with a client is `ScraplingServer`. The command is the path to the `scrapling` binary and the argument is `mcp`.
@@ -1,86 +0,0 @@
1
- # Migrating from BeautifulSoup to Scrapling
2
-
3
- API comparison between BeautifulSoup and Scrapling. Scrapling is faster, provides equivalent parsing capabilities, and adds features for fetching and handling modern web pages.
4
-
5
- Some BeautifulSoup shortcuts have no direct Scrapling equivalent. Scrapling avoids those shortcuts to preserve performance.
6
-
7
-
8
- | Task | BeautifulSoup Code | Scrapling Code |
9
- |-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
10
- | Parser import | `from bs4 import BeautifulSoup` | `from scrapling.parser import Selector` |
11
- | Parsing HTML from string | `soup = BeautifulSoup(html, 'html.parser')` | `page = Selector(html)` |
12
- | Finding a single element | `element = soup.find('div', class_='example')` | `element = page.find('div', class_='example')` |
13
- | Finding multiple elements | `elements = soup.find_all('div', class_='example')` | `elements = page.find_all('div', class_='example')` |
14
- | Finding a single element (Example 2) | `element = soup.find('div', attrs={"class": "example"})` | `element = page.find('div', {"class": "example"})` |
15
- | Finding a single element (Example 3) | `element = soup.find(re.compile("^b"))` | `element = page.find(re.compile("^b"))`<br/>`element = page.find_by_regex(r"^b")` |
16
- | Finding a single element (Example 4) | `element = soup.find(lambda e: len(list(e.children)) > 0)` | `element = page.find(lambda e: len(e.children) > 0)` |
17
- | Finding a single element (Example 5) | `element = soup.find(["a", "b"])` | `element = page.find(["a", "b"])` |
18
- | Find element by its text content | `element = soup.find(text="some text")` | `element = page.find_by_text("some text", partial=False)` |
19
- | Using CSS selectors to find the first matching element | `elements = soup.select_one('div.example')` | `elements = page.css('div.example').first` |
20
- | Using CSS selectors to find all matching element | `elements = soup.select('div.example')` | `elements = page.css('div.example')` |
21
- | Get a prettified version of the page/element source | `prettified = soup.prettify()` | `prettified = page.prettify()` |
22
- | Get a Non-pretty version of the page/element source | `source = str(soup)` | `source = page.html_content` |
23
- | Get tag name of an element | `name = element.name` | `name = element.tag` |
24
- | Extracting text content of an element | `string = element.string` | `string = element.text` |
25
- | Extracting all the text in a document or beneath a tag | `text = soup.get_text(strip=True)` | `text = page.get_all_text(strip=True)` |
26
- | Access the dictionary of attributes | `attrs = element.attrs` | `attrs = element.attrib` |
27
- | Extracting attributes | `attr = element['href']` | `attr = element['href']` |
28
- | Navigating to parent | `parent = element.parent` | `parent = element.parent` |
29
- | Get all parents of an element | `parents = list(element.parents)` | `parents = list(element.iterancestors())` |
30
- | Searching for an element in the parents of an element | `target_parent = element.find_parent("a")` | `target_parent = element.find_ancestor(lambda p: p.tag == 'a')` |
31
- | Get all siblings of an element | N/A | `siblings = element.siblings` |
32
- | Get next sibling of an element | `next_element = element.next_sibling` | `next_element = element.next` |
33
- | Searching for an element in the siblings of an element | `target_sibling = element.find_next_sibling("a")`<br/>`target_sibling = element.find_previous_sibling("a")` | `target_sibling = element.siblings.search(lambda s: s.tag == 'a')` |
34
- | Searching for elements in the siblings of an element | `target_sibling = element.find_next_siblings("a")`<br/>`target_sibling = element.find_previous_siblings("a")` | `target_sibling = element.siblings.filter(lambda s: s.tag == 'a')` |
35
- | Searching for an element in the next elements of an element | `target_parent = element.find_next("a")` | `target_parent = element.below_elements.search(lambda p: p.tag == 'a')` |
36
- | Searching for elements in the next elements of an element | `target_parent = element.find_all_next("a")` | `target_parent = element.below_elements.filter(lambda p: p.tag == 'a')` |
37
- | Searching for an element in the ancestors of an element | `target_parent = element.find_previous("a")` ¹ | `target_parent = element.path.search(lambda p: p.tag == 'a')` |
38
- | Searching for elements in the ancestors of an element | `target_parent = element.find_all_previous("a")` ¹ | `target_parent = element.path.filter(lambda p: p.tag == 'a')` |
39
- | Get previous sibling of an element | `prev_element = element.previous_sibling` | `prev_element = element.previous` |
40
- | Navigating to children | `children = list(element.children)` | `children = element.children` |
41
- | Get all descendants of an element | `children = list(element.descendants)` | `children = element.below_elements` |
42
- | Filtering a group of elements that satisfies a condition | `group = soup.find('p', 'story').css.filter('a')` | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')` |
43
-
44
-
45
- ¹ **Note:** BS4's `find_previous`/`find_all_previous` searches all preceding elements in document order, while Scrapling's `path` only returns ancestors (the parent chain). These are not exact equivalents, but ancestor search covers the most common use case.
46
-
47
- BeautifulSoup supports modifying/manipulating the parsed DOM. Scrapling does not - it is read-only and optimized for extraction.
48
-
49
- ### Full Example: Extracting Links
50
-
51
- **With BeautifulSoup:**
52
-
53
- ```python
54
- import requests
55
- from bs4 import BeautifulSoup
56
-
57
- url = 'https://example.com'
58
- response = requests.get(url)
59
- soup = BeautifulSoup(response.text, 'html.parser')
60
-
61
- links = soup.find_all('a')
62
- for link in links:
63
- print(link['href'])
64
- ```
65
-
66
- **With Scrapling:**
67
-
68
- ```python
69
- from scrapling import Fetcher
70
-
71
- url = 'https://example.com'
72
- page = Fetcher.get(url)
73
-
74
- links = page.css('a::attr(href)')
75
- for link in links:
76
- print(link)
77
- ```
78
-
79
- Scrapling combines fetching and parsing into a single step.
80
-
81
- **Note:**
82
-
83
- - **Parsers**: BeautifulSoup supports multiple parser engines. Scrapling always uses `lxml` for performance.
84
- - **Element Types**: BeautifulSoup elements are `Tag` objects; Scrapling elements are `Selector` objects. Both provide similar navigation and extraction methods.
85
- - **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.find()`). `page.css()` returns an empty `Selectors` list when no elements match. Use `page.css('.foo').first` to safely get the first match or `None`.
86
- - **Text Extraction**: Scrapling's `TextHandler` provides additional text processing methods such as `clean()` for removing extra whitespace, consecutive spaces, or unwanted characters.