ultimate-pi 0.1.2 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agents/skills/ck-search/SKILL.md +99 -0
- package/.agents/skills/defuddle/SKILL.md +90 -0
- package/.agents/skills/find-skills/SKILL.md +142 -0
- package/.agents/skills/firecrawl/SKILL.md +150 -0
- package/.agents/skills/firecrawl/rules/install.md +82 -0
- package/.agents/skills/firecrawl/rules/security.md +26 -0
- package/.agents/skills/firecrawl-agent/SKILL.md +57 -0
- package/.agents/skills/firecrawl-build-interact/SKILL.md +67 -0
- package/.agents/skills/firecrawl-build-onboarding/SKILL.md +102 -0
- package/.agents/skills/firecrawl-build-onboarding/references/auth-flow.md +39 -0
- package/.agents/skills/firecrawl-build-onboarding/references/project-setup.md +20 -0
- package/.agents/skills/firecrawl-build-onboarding/references/sdk-installation.md +17 -0
- package/.agents/skills/firecrawl-build-scrape/SKILL.md +68 -0
- package/.agents/skills/firecrawl-build-search/SKILL.md +68 -0
- package/.agents/skills/firecrawl-crawl/SKILL.md +58 -0
- package/.agents/skills/firecrawl-download/SKILL.md +69 -0
- package/.agents/skills/firecrawl-interact/SKILL.md +83 -0
- package/.agents/skills/firecrawl-map/SKILL.md +50 -0
- package/.agents/skills/firecrawl-parse/SKILL.md +61 -0
- package/.agents/skills/firecrawl-scrape/SKILL.md +68 -0
- package/.agents/skills/firecrawl-search/SKILL.md +59 -0
- package/.agents/skills/obsidian-bases/SKILL.md +299 -0
- package/.agents/skills/obsidian-markdown/SKILL.md +237 -0
- package/.agents/skills/posthog-analyst/SKILL.md +306 -0
- package/.agents/skills/posthog-analyst/evals/evals.json +23 -0
- package/.agents/skills/wiki/SKILL.md +215 -0
- package/.agents/skills/wiki/references/css-snippets.md +122 -0
- package/.agents/skills/wiki/references/frontmatter.md +107 -0
- package/.agents/skills/wiki/references/git-setup.md +58 -0
- package/.agents/skills/wiki/references/mcp-setup.md +149 -0
- package/.agents/skills/wiki/references/modes.md +259 -0
- package/.agents/skills/wiki/references/plugins.md +96 -0
- package/.agents/skills/wiki/references/rest-api.md +124 -0
- package/.agents/skills/wiki-autoresearch/SKILL.md +211 -0
- package/.agents/skills/wiki-autoresearch/references/program.md +75 -0
- package/.agents/skills/wiki-fold/SKILL.md +204 -0
- package/.agents/skills/wiki-fold/references/fold-template.md +133 -0
- package/.agents/skills/wiki-ingest/SKILL.md +288 -0
- package/.agents/skills/wiki-lint/SKILL.md +183 -0
- package/.agents/skills/wiki-query/SKILL.md +176 -0
- package/.agents/skills/wiki-save/SKILL.md +128 -0
- package/.ckignore +41 -0
- package/.env.example +9 -0
- package/.github/workflows/lint.yml +33 -0
- package/.github/workflows/publish-github-packages.yml +35 -0
- package/.github/workflows/publish-npm.yml +1 -1
- package/.pi/SYSTEM.md +107 -40
- package/.pi/agents/pi-pi/agent-expert.md +205 -0
- package/.pi/agents/pi-pi/cli-expert.md +47 -0
- package/.pi/agents/pi-pi/config-expert.md +67 -0
- package/.pi/agents/pi-pi/ext-expert.md +53 -0
- package/.pi/agents/pi-pi/keybinding-expert.md +123 -0
- package/.pi/agents/pi-pi/pi-orchestrator.md +103 -0
- package/.pi/agents/pi-pi/prompt-expert.md +83 -0
- package/.pi/agents/pi-pi/skill-expert.md +52 -0
- package/.pi/agents/pi-pi/theme-expert.md +46 -0
- package/.pi/agents/pi-pi/tui-expert.md +100 -0
- package/.pi/agents/rethink.md +140 -0
- package/.pi/agents/wiki-ingest.md +67 -0
- package/.pi/agents/wiki-lint.md +75 -0
- package/.pi/auto-commit.json +20 -0
- package/.pi/extensions/banner.png +0 -0
- package/.pi/extensions/ck-enforce.ts +216 -0
- package/.pi/extensions/custom-footer.ts +308 -0
- package/.pi/extensions/custom-header.ts +116 -0
- package/.pi/extensions/dotenv-loader.ts +170 -0
- package/.pi/internal/cursor-sdk-transcript-parser.ts +59 -0
- package/.pi/model-router.json +95 -0
- package/.pi/npm/.gitignore +2 -0
- package/.pi/prompts/git-sync.md +124 -0
- package/.pi/prompts/harness-setup.md +509 -0
- package/.pi/prompts/save.md +16 -0
- package/.pi/prompts/wiki-autoresearch.md +19 -0
- package/.pi/prompts/wiki.md +23 -0
- package/.pi/providers/cursor-sdk-provider.test.mjs +476 -0
- package/.pi/providers/cursor-sdk-provider.ts +1085 -0
- package/.pi/settings.json +14 -4
- package/.pi/skills/agent-router/SKILL.md +174 -0
- package/.pi/sounds/alert/1-kaching-track.mp3 +0 -0
- package/.pi/sounds/error/1-ksi-wth-track.mp3 +0 -0
- package/.pi/sounds/error/2-smash-track.mp3 +0 -0
- package/.pi/sounds/error/3-buzzer-track.mp3 +0 -0
- package/.pi/sounds/notification/1-soft-notification-track.mp3 +0 -0
- package/.pi/sounds/project-sounds.json +25 -0
- package/.pi/sounds/reminder/1-soft-notification-track.mp3 +0 -0
- package/.pi/sounds/success/1-tada-track.mp3 +0 -0
- package/.pi/sounds/success/2-jobs-done-track.mp3 +0 -0
- package/.pi/sounds/success/3-yay-track.mp3 +0 -0
- package/CONTRIBUTING.md +116 -0
- package/README.md +32 -39
- package/biome.json +34 -0
- package/firecrawl/.env.template +58 -0
- package/firecrawl/README.md +49 -0
- package/firecrawl/docker-compose.yaml +201 -0
- package/firecrawl/searxng/searxng.env +3 -0
- package/firecrawl/searxng/settings.yml +85 -0
- package/lefthook.yml +8 -0
- package/package.json +55 -24
- package/vault/AGENTS.md +37 -0
- package/vault/wiki/_templates/comparison.md +39 -0
- package/vault/wiki/_templates/concept.md +40 -0
- package/vault/wiki/_templates/decision.md +21 -0
- package/vault/wiki/_templates/entity.md +32 -0
- package/vault/wiki/_templates/flow.md +14 -0
- package/vault/wiki/_templates/module.md +18 -0
- package/vault/wiki/_templates/question.md +31 -0
- package/vault/wiki/_templates/source.md +39 -0
- package/vault/wiki/concepts/AST-Aware Code Chunking.md +44 -0
- package/vault/wiki/concepts/Build-Time Prompt Compilation.md +107 -0
- package/vault/wiki/concepts/Context Engine (AI Coding).md +47 -0
- package/vault/wiki/concepts/Context-Aware System Reminders.md +61 -0
- package/vault/wiki/concepts/Contextualized Text Embedding.md +42 -0
- package/vault/wiki/concepts/Contractor vs Employee AI Model.md +55 -0
- package/vault/wiki/concepts/Dual-Model Agent Architecture.md +65 -0
- package/vault/wiki/concepts/Late Chunking vs Early Chunking.md +43 -0
- package/vault/wiki/concepts/Majority Vote Ensembling.md +68 -0
- package/vault/wiki/concepts/Meta-Harness.md +16 -0
- package/vault/wiki/concepts/Multi-Agent AI Coding Architecture.md +75 -0
- package/vault/wiki/concepts/Prompt Enhancement.md +90 -0
- package/vault/wiki/concepts/Prompt Renderer.md +89 -0
- package/vault/wiki/concepts/Semantic Codebase Indexing.md +67 -0
- package/vault/wiki/concepts/additive-config-hierarchy.md +16 -0
- package/vault/wiki/concepts/agent-artifacts-verifiable-deliverables.md +71 -0
- package/vault/wiki/concepts/agent-browser-browser-automation.md +99 -0
- package/vault/wiki/concepts/agent-codebase-interface.md +43 -0
- package/vault/wiki/concepts/agent-harness-architecture.md +67 -0
- package/vault/wiki/concepts/agent-loop-detection-patterns.md +133 -0
- package/vault/wiki/concepts/agent-search-enforcement.md +126 -0
- package/vault/wiki/concepts/agent-skills-ecosystem.md +74 -0
- package/vault/wiki/concepts/agent-skills-pattern.md +68 -0
- package/vault/wiki/concepts/agentic-harness-context-enforcement.md +91 -0
- package/vault/wiki/concepts/agentic-harness.md +34 -0
- package/vault/wiki/concepts/agentic-orchestration-pipeline.md +56 -0
- package/vault/wiki/concepts/agentic-search-no-embeddings.md +18 -0
- package/vault/wiki/concepts/anthropic-context-engineering.md +13 -0
- package/vault/wiki/concepts/antigravity-agent-first-architecture.md +61 -0
- package/vault/wiki/concepts/ast-compression.md +19 -0
- package/vault/wiki/concepts/ast-truncation.md +66 -0
- package/vault/wiki/concepts/barrel-files.md +37 -0
- package/vault/wiki/concepts/browser-harness-agent.md +41 -0
- package/vault/wiki/concepts/browser-subagent-visual-verification.md +82 -0
- package/vault/wiki/concepts/codebase-intelligence-ecosystem-comparison.md +192 -0
- package/vault/wiki/concepts/codebase-intelligence-harness-integration.md +161 -0
- package/vault/wiki/concepts/codebase-to-context-ingestion.md +46 -0
- package/vault/wiki/concepts/codex-harness-innovations.md +147 -0
- package/vault/wiki/concepts/consensus-debate-flow.md +17 -0
- package/vault/wiki/concepts/consensus-debate.md +206 -0
- package/vault/wiki/concepts/content-addressed-spec-identity.md +166 -0
- package/vault/wiki/concepts/context-anxiety.md +57 -0
- package/vault/wiki/concepts/context-compression-techniques.md +19 -0
- package/vault/wiki/concepts/context-continuity.md +22 -0
- package/vault/wiki/concepts/context-drift-in-agents.md +106 -0
- package/vault/wiki/concepts/context-engineering.md +62 -0
- package/vault/wiki/concepts/context-folding.md +67 -0
- package/vault/wiki/concepts/context-mode.md +38 -0
- package/vault/wiki/concepts/cursor-harness-innovations.md +107 -0
- package/vault/wiki/concepts/deterministic-session-compaction.md +79 -0
- package/vault/wiki/concepts/drift-detection-unified.md +296 -0
- package/vault/wiki/concepts/execution-feedback-loop.md +46 -0
- package/vault/wiki/concepts/feedforward-feedback-harness.md +60 -0
- package/vault/wiki/concepts/five-root-cause-metrics-sentrux.md +40 -0
- package/vault/wiki/concepts/fork-safe-spec-storage.md +89 -0
- package/vault/wiki/concepts/fts5-sandbox.md +19 -0
- package/vault/wiki/concepts/fuzzy-edit-matching.md +71 -0
- package/vault/wiki/concepts/gemini-cli-architecture.md +104 -0
- package/vault/wiki/concepts/generator-evaluator-architecture.md +64 -0
- package/vault/wiki/concepts/guardian-agent-pattern.md +67 -0
- package/vault/wiki/concepts/harness-configuration-layers.md +89 -0
- package/vault/wiki/concepts/harness-control-frameworks.md +155 -0
- package/vault/wiki/concepts/harness-engineering-first-principles.md +90 -0
- package/vault/wiki/concepts/harness-h-formalism.md +53 -0
- package/vault/wiki/concepts/hybrid-code-search.md +61 -0
- package/vault/wiki/concepts/inline-post-edit-validation.md +112 -0
- package/vault/wiki/concepts/legendary-engineering-patterns-harness.md +110 -0
- package/vault/wiki/concepts/lifecycle-hooks.md +94 -0
- package/vault/wiki/concepts/mcp-tool-routing.md +102 -0
- package/vault/wiki/concepts/memory-system-of-record-vs-ephemeral-cache.md +47 -0
- package/vault/wiki/concepts/meta-agent-context-pruning.md +151 -0
- package/vault/wiki/concepts/model-adaptive-harness.md +122 -0
- package/vault/wiki/concepts/model-routing-agents.md +101 -0
- package/vault/wiki/concepts/monorepo-architecture.md +45 -0
- package/vault/wiki/concepts/multi-agent-specialization.md +61 -0
- package/vault/wiki/concepts/permission-subsystem.md +16 -0
- package/vault/wiki/concepts/pi-messenger-analysis.md +243 -0
- package/vault/wiki/concepts/pi-vscode-extension-landscape.md +37 -0
- package/vault/wiki/concepts/policy-engine-pattern.md +78 -0
- package/vault/wiki/concepts/progressive-disclosure-agents.md +53 -0
- package/vault/wiki/concepts/progressive-skill-disclosure.md +17 -0
- package/vault/wiki/concepts/provider-native-prompting.md +203 -0
- package/vault/wiki/concepts/quality-signal-sentrux.md +37 -0
- package/vault/wiki/concepts/repo-map-ranking.md +42 -0
- package/vault/wiki/concepts/result-monad-error-handling.md +47 -0
- package/vault/wiki/concepts/safety-defense-in-depth.md +83 -0
- package/vault/wiki/concepts/sandbox-os-enforcement.md +18 -0
- package/vault/wiki/concepts/selective-debate-routing.md +70 -0
- package/vault/wiki/concepts/self-evolving-harness.md +60 -0
- package/vault/wiki/concepts/sentrux-mcp-integration.md +36 -0
- package/vault/wiki/concepts/sentrux-rules-engine.md +49 -0
- package/vault/wiki/concepts/shell-pattern-compression.md +24 -0
- package/vault/wiki/concepts/skill-first-architecture.md +166 -0
- package/vault/wiki/concepts/structured-compaction.md +78 -0
- package/vault/wiki/concepts/subagent-orchestration.md +17 -0
- package/vault/wiki/concepts/subagent-worktree-isolation.md +68 -0
- package/vault/wiki/concepts/superpowers-methodology.md +78 -0
- package/vault/wiki/concepts/think-in-code.md +73 -0
- package/vault/wiki/concepts/ts-execution-layer.md +100 -0
- package/vault/wiki/concepts/typescript-strict-mode.md +37 -0
- package/vault/wiki/concepts/vcc-conversation-compaction-for-pi.md +51 -0
- package/vault/wiki/concepts/verification-drift-detection.md +19 -0
- package/vault/wiki/consensus/consensus-records.md +58 -0
- package/vault/wiki/decisions/2026-04-30-pi-lean-ctx-native.md +122 -0
- package/vault/wiki/decisions/adr-008.md +40 -0
- package/vault/wiki/decisions/adr-009.md +46 -0
- package/vault/wiki/decisions/adr-010.md +55 -0
- package/vault/wiki/decisions/adr-011.md +165 -0
- package/vault/wiki/decisions/adr-012.md +102 -0
- package/vault/wiki/decisions/adr-013.md +59 -0
- package/vault/wiki/decisions/adr-014.md +73 -0
- package/vault/wiki/decisions/adr-015.md +81 -0
- package/vault/wiki/decisions/adr-016.md +91 -0
- package/vault/wiki/decisions/adr-017.md +79 -0
- package/vault/wiki/decisions/adr-018.md +100 -0
- package/vault/wiki/decisions/adr-019.md +75 -0
- package/vault/wiki/decisions/adr-020.md +106 -0
- package/vault/wiki/decisions/adr-021.md +86 -0
- package/vault/wiki/decisions/adr-022.md +113 -0
- package/vault/wiki/decisions/adr-023.md +113 -0
- package/vault/wiki/decisions/adr-024.md +73 -0
- package/vault/wiki/decisions/adr-025.md +130 -0
- package/vault/wiki/decisions/adr-026.md +56 -0
- package/vault/wiki/decisions/colocate-wiki.md +34 -0
- package/vault/wiki/entities/Anders Hejlsberg.md +29 -0
- package/vault/wiki/entities/Anthropic.md +17 -0
- package/vault/wiki/entities/Augment Code.md +49 -0
- package/vault/wiki/entities/Bjarne Stroustrup.md +26 -0
- package/vault/wiki/entities/Bolt.new (StackBlitz).md +39 -0
- package/vault/wiki/entities/Boris Cherny.md +11 -0
- package/vault/wiki/entities/Claude Code.md +19 -0
- package/vault/wiki/entities/Dennis Ritchie.md +26 -0
- package/vault/wiki/entities/Emergent Labs.md +32 -0
- package/vault/wiki/entities/Google Cloud.md +16 -0
- package/vault/wiki/entities/Guido van Rossum.md +28 -0
- package/vault/wiki/entities/Ken Thompson.md +28 -0
- package/vault/wiki/entities/Lee et al.md +16 -0
- package/vault/wiki/entities/Linus Torvalds.md +28 -0
- package/vault/wiki/entities/Lovable (company).md +40 -0
- package/vault/wiki/entities/Martin Fowler.md +16 -0
- package/vault/wiki/entities/Meng et al.md +16 -0
- package/vault/wiki/entities/OpenAI.md +16 -0
- package/vault/wiki/entities/Rocket.new.md +38 -0
- package/vault/wiki/entities/VILA-Lab.md +15 -0
- package/vault/wiki/entities/autodev-codebase.md +18 -0
- package/vault/wiki/entities/ck-tool.md +59 -0
- package/vault/wiki/entities/codesearch.md +18 -0
- package/vault/wiki/entities/disler-indydevdan.md +33 -0
- package/vault/wiki/entities/gsd-get-shit-done.md +56 -0
- package/vault/wiki/entities/javascript-runtimes.md +48 -0
- package/vault/wiki/entities/jesse-vincent.md +38 -0
- package/vault/wiki/entities/lean-ctx.md +32 -0
- package/vault/wiki/entities/opendev.md +41 -0
- package/vault/wiki/entities/ops-codegraph-tool.md +18 -0
- package/vault/wiki/entities/pi-coding-agent.md +53 -0
- package/vault/wiki/entities/sentrux.md +54 -0
- package/vault/wiki/entities/vgrep-tool.md +57 -0
- package/vault/wiki/entities/vitest.md +41 -0
- package/vault/wiki/flows/harness-wiki-pipeline.md +204 -0
- package/vault/wiki/hot.md +932 -0
- package/vault/wiki/index.md +437 -0
- package/vault/wiki/log.md +418 -0
- package/vault/wiki/meta/dashboard.md +30 -0
- package/vault/wiki/meta/lint-report-2026-04-30.md +86 -0
- package/vault/wiki/meta/lint-report-2026-05-02.md +251 -0
- package/vault/wiki/meta/overview.canvas +43 -0
- package/vault/wiki/modules/adversarial-verification.md +57 -0
- package/vault/wiki/modules/automated-observability.md +54 -0
- package/vault/wiki/modules/bench.md +20 -0
- package/vault/wiki/modules/extensions.md +23 -0
- package/vault/wiki/modules/grounding-checkpoints.md +62 -0
- package/vault/wiki/modules/harness-implementation-plan.md +345 -0
- package/vault/wiki/modules/harness-wiki-skill-mapping.md +135 -0
- package/vault/wiki/modules/harness.md +86 -0
- package/vault/wiki/modules/persistent-memory.md +85 -0
- package/vault/wiki/modules/schema-orchestration.md +68 -0
- package/vault/wiki/modules/skills.md +27 -0
- package/vault/wiki/modules/spec-hardening.md +58 -0
- package/vault/wiki/modules/structured-planning.md +53 -0
- package/vault/wiki/modules/think-in-code-enforcement.md +153 -0
- package/vault/wiki/modules/wiki-query-interface.md +64 -0
- package/vault/wiki/overview.md +51 -0
- package/vault/wiki/questions/Research-pi-vs-claude-code-agentic-orchestration-pipeline.md +87 -0
- package/vault/wiki/questions/Research-sentrux-dev.md +123 -0
- package/vault/wiki/questions/Research-superpowers-skill-for-agentic-coding-agents.md +164 -0
- package/vault/wiki/questions/Research: Augment Code Context Engine.md +244 -0
- package/vault/wiki/questions/Research: Automating Software Engineering - Lovable, Bolt, Emergent, Rocket.md +112 -0
- package/vault/wiki/questions/Research: Claude Code State-of-the-Art Harness Improvements.md +209 -0
- package/vault/wiki/questions/Research: Codex State-of-the-Art Harness Improvements.md +99 -0
- package/vault/wiki/questions/Research: Engineering Workflows of Legendary Programmers and AI Harness Mapping.md +107 -0
- package/vault/wiki/questions/Research: Fallow Codebase Intelligence Harness Integration.md +72 -0
- package/vault/wiki/questions/Research: Gemini CLI SOTA Harness Integration.md +166 -0
- package/vault/wiki/questions/Research: GitHub Issues as Harness Spec Storage.md +188 -0
- package/vault/wiki/questions/Research: Google Antigravity Harness Integration.md +120 -0
- package/vault/wiki/questions/Research: Meta-Agent Context Drift Detection.md +236 -0
- package/vault/wiki/questions/Research: Model-Adaptive Agent Harness Design.md +95 -0
- package/vault/wiki/questions/Research: Model-Specific Prompting Guides.md +165 -0
- package/vault/wiki/questions/Research: Prompt Renderer for Multi-Model Agent Harness.md +216 -0
- package/vault/wiki/questions/Research: Skill-First Harness Architecture.md +91 -0
- package/vault/wiki/questions/Research: TypeScript Best Practices and Codebase Structure.md +88 -0
- package/vault/wiki/questions/Research: TypeScript Execution Layer for Agent Tool Calling.md +81 -0
- package/vault/wiki/questions/Research: claude-mem over Obsidian for Harness Layer.md +71 -0
- package/vault/wiki/questions/Research: claude-mem over obsidian wiki as the knowledge base for our agentic harness pipeline. think from first principles. does this replace or complement our current setup? no hard feelings about previous decisions. gimme accurate points.md +80 -0
- package/vault/wiki/questions/Research: context-mode vs lean-ctx.md +72 -0
- package/vault/wiki/questions/Research: cursor.sh Harness Innovations.md +92 -0
- package/vault/wiki/questions/Research: executor.sh Harness Integration.md +170 -0
- package/vault/wiki/questions/Research: how GSD fits into our coding harness setup.md +97 -0
- package/vault/wiki/questions/Research: how claude-mem fits into our workflow. and whether it should replace obsidian in the codebase. no hard feelings about previous actions, rethink from first principles always.md +80 -0
- package/vault/wiki/questions/Research: pi-vcc.md +113 -0
- package/vault/wiki/questions/Research: semantic code search tools.md +69 -0
- package/vault/wiki/questions/Research: vcc extension for pi coding agent.md +73 -0
- package/vault/wiki/questions/how-to-enable-semantic-code-search-now.md +111 -0
- package/vault/wiki/questions/mvp-implementation-blueprint.md +552 -0
- package/vault/wiki/questions/research-agent-first-codebase-exploration.md +199 -0
- package/vault/wiki/questions/research-agentic-coding-harness-latest-papers.md +142 -0
- package/vault/wiki/questions/research-gitingest-gitreverse-integration.md +100 -0
- package/vault/wiki/questions/research-wozcode-token-reduction.md +67 -0
- package/vault/wiki/questions/resolved-context-pruning-inplace-vs-restart.md +95 -0
- package/vault/wiki/questions/resolved-context-window-economics.md +167 -0
- package/vault/wiki/questions/resolved-imad-debate-gating-transfer.md +126 -0
- package/vault/wiki/questions/resolved-mcp-tool-preference.md +112 -0
- package/vault/wiki/questions/resolved-small-model-meta-agents.md +107 -0
- package/vault/wiki/questions/resolved-treesitter-dynamic-languages.md +95 -0
- package/vault/wiki/sources/Auggie Context MCP Server.md +63 -0
- package/vault/wiki/sources/Augment Code Codacy AI Giants.md +61 -0
- package/vault/wiki/sources/Augment Code MCP SiliconAngle.md +49 -0
- package/vault/wiki/sources/Augment Code WorkOS ERC 2025.md +55 -0
- package/vault/wiki/sources/Augment Context Engine Official.md +71 -0
- package/vault/wiki/sources/Augment SWE-bench Agent GitHub.md +74 -0
- package/vault/wiki/sources/Augment SWE-bench Pro Blog.md +58 -0
- package/vault/wiki/sources/Source: AgentBus Jinja2 Prompt Pipelines.md +75 -0
- package/vault/wiki/sources/Source: Arxiv /342/200/224 Don't Break the Cache.md" +85 -0
- package/vault/wiki/sources/Source: Augment - Harness Engineering for AI Coding Agents.md +58 -0
- package/vault/wiki/sources/Source: Blake Crosley Agent Architecture Guide.md +100 -0
- package/vault/wiki/sources/Source: Bolt.new Architecture & Case Study.md +75 -0
- package/vault/wiki/sources/Source: Build-Time Prompt Compilation Architecture.md +107 -0
- package/vault/wiki/sources/Source: Claude API Agent Skills Overview.md +70 -0
- package/vault/wiki/sources/Source: Gemini CLI Changelogs.md +88 -0
- package/vault/wiki/sources/Source: Google Blog - Gemini CLI Announcement.md +57 -0
- package/vault/wiki/sources/Source: Google Gemini CLI Architecture Docs.md +53 -0
- package/vault/wiki/sources/Source: LangChain - Anatomy of Agent Harness.md +65 -0
- package/vault/wiki/sources/Source: Lovable Architecture & Clone Analysis.md +83 -0
- package/vault/wiki/sources/Source: Martin Fowler - Harness Engineering.md +70 -0
- package/vault/wiki/sources/Source: OpenAI Harness Engineering Five Principles.md +58 -0
- package/vault/wiki/sources/Source: OpenAI Harness Engineering /342/200/224 0 Lines of Human Code.md" +101 -0
- package/vault/wiki/sources/Source: OpenDev /342/200/224 Building AI Coding Agents for the Terminal.md" +100 -0
- package/vault/wiki/sources/Source: Render AI Coding Agents Benchmark 2025.md +53 -0
- package/vault/wiki/sources/Source: Rocket.new /342/200/224 Vibe Solutioning Platform.md" +70 -0
- package/vault/wiki/sources/Source: SwirlAI Agent Skills Progressive Disclosure.md +71 -0
- package/vault/wiki/sources/Source: TianPan Prompt Caching Architecture.md +89 -0
- package/vault/wiki/sources/Source: Vercel Labs agent-browser.md +155 -0
- package/vault/wiki/sources/Source: browser-harness CDP Harness.md +126 -0
- package/vault/wiki/sources/agent-drift-academic-paper.md +79 -0
- package/vault/wiki/sources/aider-repomap-tree-sitter.md +42 -0
- package/vault/wiki/sources/anthropic-compaction-api.md +58 -0
- package/vault/wiki/sources/anthropic-effective-harnesses.md +42 -0
- package/vault/wiki/sources/anthropic-prompt-best-practices.md +100 -0
- package/vault/wiki/sources/anthropic2026-harness-design.md +63 -0
- package/vault/wiki/sources/barrel-files-tkdodo.md +38 -0
- package/vault/wiki/sources/birth-of-unix-kernighan-interview.md +57 -0
- package/vault/wiki/sources/bockeler2026-harness-engineering.md +69 -0
- package/vault/wiki/sources/cast-code-chunking-paper.md +50 -0
- package/vault/wiki/sources/ck-semantic-search.md +78 -0
- package/vault/wiki/sources/claude-code-architecture-karaxai-2026.md +71 -0
- package/vault/wiki/sources/claude-code-architecture-qubytes-2026.md +50 -0
- package/vault/wiki/sources/claude-code-architecture-vila-lab-2026.md +64 -0
- package/vault/wiki/sources/claude-code-security-architecture-penligent-2026.md +70 -0
- package/vault/wiki/sources/claude-context-editing-docs.md +13 -0
- package/vault/wiki/sources/cloudflare-codemode.md +63 -0
- package/vault/wiki/sources/code-chunk-library-supermemory.md +63 -0
- package/vault/wiki/sources/codeact-apple-2024.md +62 -0
- package/vault/wiki/sources/codex-dsc-rfc-8573.md +41 -0
- package/vault/wiki/sources/codex-open-source-agent-2026.md +110 -0
- package/vault/wiki/sources/coir-code-retrieval-benchmark.md +51 -0
- package/vault/wiki/sources/colinmcnamara-context-optimization-codemode.md +48 -0
- package/vault/wiki/sources/context-folding-paper.md +61 -0
- package/vault/wiki/sources/context-mode-website.md +63 -0
- package/vault/wiki/sources/cursor-agent-best-practices-2026.md +62 -0
- package/vault/wiki/sources/cursor-fork-29b-2025.md +50 -0
- package/vault/wiki/sources/cursor-harness-april-2026.md +76 -0
- package/vault/wiki/sources/cursor-instant-apply-2024.md +45 -0
- package/vault/wiki/sources/cursor-shadow-workspace-2024.md +52 -0
- package/vault/wiki/sources/cursor-shipped-coding-agent-2026.md +53 -0
- package/vault/wiki/sources/cursor-vs-antigravity-2026.md +51 -0
- package/vault/wiki/sources/disler-pi-vs-claude-code.md +69 -0
- package/vault/wiki/sources/distill-deterministic-context-compression.md +53 -0
- package/vault/wiki/sources/embedding-models-benchmark-supermemory-2025.md +48 -0
- package/vault/wiki/sources/executor-rhyssullivan.md +122 -0
- package/vault/wiki/sources/fallow-rs-codebase-intelligence.md +125 -0
- package/vault/wiki/sources/fan2025-imad.md +60 -0
- package/vault/wiki/sources/forgecode-gpt5-agent-improvements.md +63 -0
- package/vault/wiki/sources/gemini-3-prompting-guide.md +78 -0
- package/vault/wiki/sources/gh-cli-sub-issue-rfc.md +50 -0
- package/vault/wiki/sources/gh-sub-issue-extension.md +72 -0
- package/vault/wiki/sources/github-fork-issues-discussion.md +44 -0
- package/vault/wiki/sources/github-issue-dependencies-docs.md +49 -0
- package/vault/wiki/sources/github-sub-issues-docs.md +51 -0
- package/vault/wiki/sources/gitingest.md +91 -0
- package/vault/wiki/sources/gitreverse.md +63 -0
- package/vault/wiki/sources/google-antigravity-official-blog.md +47 -0
- package/vault/wiki/sources/google-antigravity-wikipedia.md +53 -0
- package/vault/wiki/sources/gsd-codecentric-deep-dive.md +57 -0
- package/vault/wiki/sources/gsd-github-repo.md +51 -0
- package/vault/wiki/sources/gsd-hn-discussion.md +59 -0
- package/vault/wiki/sources/guido-python-design-philosophy.md +56 -0
- package/vault/wiki/sources/hejlsberg-7-learnings.md +48 -0
- package/vault/wiki/sources/ironclaw-drift-monitor.md +80 -0
- package/vault/wiki/sources/langsight-loop-detection.md +80 -0
- package/vault/wiki/sources/leanctx-website.md +69 -0
- package/vault/wiki/sources/lee2026-meta-harness.md +59 -0
- package/vault/wiki/sources/linux-kernel-coding-workflow.md +50 -0
- package/vault/wiki/sources/lou2026-autoharness.md +53 -0
- package/vault/wiki/sources/martin-fowler-harness-engineering.md +73 -0
- package/vault/wiki/sources/mcp-architecture-docs.md +13 -0
- package/vault/wiki/sources/meng2026-agent-harness-survey.md +79 -0
- package/vault/wiki/sources/mindstudio-four-agent-types.md +68 -0
- package/vault/wiki/sources/ms-chat-history-management.md +13 -0
- package/vault/wiki/sources/openai-prompt-guidance.md +104 -0
- package/vault/wiki/sources/openclaw-session-pruning.md +13 -0
- package/vault/wiki/sources/opencode-dcp.md +13 -0
- package/vault/wiki/sources/opendev-arxiv-2603.05344v1.md +79 -0
- package/vault/wiki/sources/openhands-platform.md +39 -0
- package/vault/wiki/sources/oss-guide-codebase-exploration.md +53 -0
- package/vault/wiki/sources/pi-compaction-extensions-ecosystem.md +102 -0
- package/vault/wiki/sources/pi-context-prune-github-repo.md +38 -0
- package/vault/wiki/sources/pi-mono-compaction-docs.md +38 -0
- package/vault/wiki/sources/pi-omni-compact-github-repo.md +50 -0
- package/vault/wiki/sources/pi-rtk-optimizer-github-repo.md +45 -0
- package/vault/wiki/sources/pi-vcc-github-repo.md +69 -0
- package/vault/wiki/sources/pi-vscode-marketplace.md +41 -0
- package/vault/wiki/sources/pi-vscode-model-provider-marketplace.md +39 -0
- package/vault/wiki/sources/py-tree-sitter.md +13 -0
- package/vault/wiki/sources/sentrux-dev-landing.md +40 -0
- package/vault/wiki/sources/sentrux-docs-pro-architecture.md +75 -0
- package/vault/wiki/sources/sentrux-docs-quality-signal.md +46 -0
- package/vault/wiki/sources/sentrux-docs-root-cause-metrics.md +57 -0
- package/vault/wiki/sources/sentrux-docs-rules-engine.md +58 -0
- package/vault/wiki/sources/sentrux-github-repo.md +56 -0
- package/vault/wiki/sources/superpowers-github-repo.md +56 -0
- package/vault/wiki/sources/superpowers-release-blog.md +54 -0
- package/vault/wiki/sources/superpowers-termdock-analysis.md +45 -0
- package/vault/wiki/sources/swe-agent-aci.md +42 -0
- package/vault/wiki/sources/swe-bench.md +45 -0
- package/vault/wiki/sources/swe-pruner-context-pruning.md +13 -0
- package/vault/wiki/sources/think-in-code-blog.md +48 -0
- package/vault/wiki/sources/tree-sitter-docs.md +13 -0
- package/vault/wiki/sources/ts-best-practices-2025-devto.md +42 -0
- package/vault/wiki/sources/ts-folder-structure-mingyang.md +58 -0
- package/vault/wiki/sources/ts-monorepo-koerselman.md +44 -0
- package/vault/wiki/sources/ts-result-error-handling-kkalamarski.md +52 -0
- package/vault/wiki/sources/ts-runtimes-comparison-betterstack.md +42 -0
- package/vault/wiki/sources/ts-strict-mode-rishikc.md +43 -0
- package/vault/wiki/sources/unix-philosophy.md +48 -0
- package/vault/wiki/sources/vectara-chunking-vs-embedding-naacl2025.md +39 -0
- package/vault/wiki/sources/vectara-guardian-agents.md +79 -0
- package/vault/wiki/sources/vgrep-semantic-search.md +76 -0
- package/vault/wiki/sources/vitest-official.md +41 -0
- package/vault/wiki/sources/vscode-pi-community-extension.md +40 -0
- package/vault/wiki/sources/wozcode.md +79 -0
- package/.agents/skills/compress/SKILL.md +0 -111
- package/.agents/skills/compress/scripts/__init__.py +0 -9
- package/.agents/skills/compress/scripts/__main__.py +0 -3
- package/.agents/skills/compress/scripts/benchmark.py +0 -78
- package/.agents/skills/compress/scripts/cli.py +0 -73
- package/.agents/skills/compress/scripts/compress.py +0 -227
- package/.agents/skills/compress/scripts/detect.py +0 -121
- package/.agents/skills/compress/scripts/validate.py +0 -189
- package/.agents/skills/emil-design-eng/SKILL.md +0 -679
- package/.agents/skills/lean-ctx/SKILL.md +0 -149
- package/.agents/skills/lean-ctx/scripts/install.sh +0 -95
- package/.agents/skills/scrapling-official/LICENSE.txt +0 -28
- package/.agents/skills/scrapling-official/SKILL.md +0 -390
- package/.agents/skills/scrapling-official/examples/01_fetcher_session.py +0 -26
- package/.agents/skills/scrapling-official/examples/02_dynamic_session.py +0 -26
- package/.agents/skills/scrapling-official/examples/03_stealthy_session.py +0 -26
- package/.agents/skills/scrapling-official/examples/04_spider.py +0 -58
- package/.agents/skills/scrapling-official/examples/README.md +0 -45
- package/.agents/skills/scrapling-official/references/fetching/choosing.md +0 -78
- package/.agents/skills/scrapling-official/references/fetching/dynamic.md +0 -352
- package/.agents/skills/scrapling-official/references/fetching/static.md +0 -432
- package/.agents/skills/scrapling-official/references/fetching/stealthy.md +0 -255
- package/.agents/skills/scrapling-official/references/mcp-server.md +0 -214
- package/.agents/skills/scrapling-official/references/migrating_from_beautifulsoup.md +0 -86
- package/.agents/skills/scrapling-official/references/parsing/adaptive.md +0 -212
- package/.agents/skills/scrapling-official/references/parsing/main_classes.md +0 -586
- package/.agents/skills/scrapling-official/references/parsing/selection.md +0 -494
- package/.agents/skills/scrapling-official/references/spiders/advanced.md +0 -344
- package/.agents/skills/scrapling-official/references/spiders/architecture.md +0 -94
- package/.agents/skills/scrapling-official/references/spiders/getting-started.md +0 -164
- package/.agents/skills/scrapling-official/references/spiders/proxy-blocking.md +0 -235
- package/.agents/skills/scrapling-official/references/spiders/requests-responses.md +0 -196
- package/.agents/skills/scrapling-official/references/spiders/sessions.md +0 -205
- package/PLAN.md +0 -11
- package/extensions/lean-ctx-enforce.ts +0 -166
- package/skills-lock.json +0 -35
- package/wiki/README.md +0 -19
- package/wiki/decisions/0001-establish-project-wiki-and-decision-record-format.md +0 -25
- package/wiki/decisions/0002-add-project-banner-to-readme.md +0 -26
- package/wiki/decisions/0003-remove-redundant-readme-title-heading.md +0 -26
- package/wiki/decisions/0004-publish-package-to-npm-as-ultimate-pi.md +0 -26
- package/wiki/decisions/0005-automate-npm-publish-with-github-actions.md +0 -27
- package/wiki/decisions/0006-switch-to-npm-trusted-publishing.md +0 -26
- package/wiki/decisions/0007-use-absolute-banner-url-for-npm-readme-rendering.md +0 -26
- package/wiki/decisions/0008-rename-banner-asset-for-cache-busting.md +0 -26
- package/wiki/decisions/0009-force-oidc-path-by-clearing-node-auth-token-in-publish-step.md +0 -25
- package/wiki/decisions/0010-simplify-setup-node-for-npm-trusted-publishing.md +0 -26
- package/wiki/decisions/0011-add-noop-workflow-change-to-force-fresh-publish-run.md +0 -25
- package/wiki/decisions/0012-align-workflow-runtime-with-npm-trusted-publishing-requirements.md +0 -26
- package/wiki/decisions/0013-add-package-repository-url-for-provenance-validation.md +0 -25
|
@@ -1,26 +0,0 @@
|
|
|
1
|
-
"""
|
|
2
|
-
Example 1: Python - FetcherSession (persistent HTTP session with Chrome TLS fingerprint)
|
|
3
|
-
|
|
4
|
-
Scrapes all 10 pages of quotes.toscrape.com using a single HTTP session.
|
|
5
|
-
No browser launched - fast and lightweight.
|
|
6
|
-
|
|
7
|
-
Best for: static or semi-static sites, APIs, pages that don't require JavaScript.
|
|
8
|
-
"""
|
|
9
|
-
|
|
10
|
-
from scrapling.fetchers import FetcherSession
|
|
11
|
-
|
|
12
|
-
all_quotes = []
|
|
13
|
-
|
|
14
|
-
with FetcherSession(impersonate="chrome") as session:
|
|
15
|
-
for i in range(1, 11):
|
|
16
|
-
page = session.get(
|
|
17
|
-
f"https://quotes.toscrape.com/page/{i}/",
|
|
18
|
-
stealthy_headers=True,
|
|
19
|
-
)
|
|
20
|
-
quotes = page.css(".quote .text::text").getall()
|
|
21
|
-
all_quotes.extend(quotes)
|
|
22
|
-
print(f"Page {i}: {len(quotes)} quotes (status {page.status})")
|
|
23
|
-
|
|
24
|
-
print(f"\nTotal: {len(all_quotes)} quotes\n")
|
|
25
|
-
for i, quote in enumerate(all_quotes, 1):
|
|
26
|
-
print(f"{i:>3}. {quote}")
|
|
@@ -1,26 +0,0 @@
|
|
|
1
|
-
"""
|
|
2
|
-
Example 2: Python - DynamicSession (Playwright browser automation, visible)
|
|
3
|
-
|
|
4
|
-
Scrapes all 10 pages of quotes.toscrape.com using a persistent browser session.
|
|
5
|
-
The browser window stays open across all page requests for efficiency.
|
|
6
|
-
|
|
7
|
-
Best for: JavaScript-heavy pages, SPAs, sites with dynamic content loading.
|
|
8
|
-
|
|
9
|
-
Set headless=True to run the browser hidden.
|
|
10
|
-
Set disable_resources=True to skip loading images/fonts for a speed boost.
|
|
11
|
-
"""
|
|
12
|
-
|
|
13
|
-
from scrapling.fetchers import DynamicSession
|
|
14
|
-
|
|
15
|
-
all_quotes = []
|
|
16
|
-
|
|
17
|
-
with DynamicSession(headless=False, disable_resources=True) as session:
|
|
18
|
-
for i in range(1, 11):
|
|
19
|
-
page = session.fetch(f"https://quotes.toscrape.com/page/{i}/")
|
|
20
|
-
quotes = page.css(".quote .text::text").getall()
|
|
21
|
-
all_quotes.extend(quotes)
|
|
22
|
-
print(f"Page {i}: {len(quotes)} quotes (status {page.status})")
|
|
23
|
-
|
|
24
|
-
print(f"\nTotal: {len(all_quotes)} quotes\n")
|
|
25
|
-
for i, quote in enumerate(all_quotes, 1):
|
|
26
|
-
print(f"{i:>3}. {quote}")
|
|
@@ -1,26 +0,0 @@
|
|
|
1
|
-
"""
|
|
2
|
-
Example 3: Python - StealthySession (Patchright stealth browser, visible)
|
|
3
|
-
|
|
4
|
-
Scrapes all 10 pages of quotes.toscrape.com using a persistent stealth browser session.
|
|
5
|
-
Bypasses anti-bot protections automatically (Cloudflare Turnstile, fingerprinting, etc.).
|
|
6
|
-
|
|
7
|
-
Best for: well-protected sites, Cloudflare-gated pages, sites that detect Playwright.
|
|
8
|
-
|
|
9
|
-
Set headless=True to run the browser hidden.
|
|
10
|
-
Add solve_cloudflare=True to auto-solve Cloudflare challenges.
|
|
11
|
-
"""
|
|
12
|
-
|
|
13
|
-
from scrapling.fetchers import StealthySession
|
|
14
|
-
|
|
15
|
-
all_quotes = []
|
|
16
|
-
|
|
17
|
-
with StealthySession(headless=False) as session:
|
|
18
|
-
for i in range(1, 11):
|
|
19
|
-
page = session.fetch(f"https://quotes.toscrape.com/page/{i}/")
|
|
20
|
-
quotes = page.css(".quote .text::text").getall()
|
|
21
|
-
all_quotes.extend(quotes)
|
|
22
|
-
print(f"Page {i}: {len(quotes)} quotes (status {page.status})")
|
|
23
|
-
|
|
24
|
-
print(f"\nTotal: {len(all_quotes)} quotes\n")
|
|
25
|
-
for i, quote in enumerate(all_quotes, 1):
|
|
26
|
-
print(f"{i:>3}. {quote}")
|
|
@@ -1,58 +0,0 @@
|
|
|
1
|
-
"""
|
|
2
|
-
Example 4: Python - Spider (auto-crawling framework)
|
|
3
|
-
|
|
4
|
-
Scrapes ALL pages of quotes.toscrape.com by following "Next" pagination links
|
|
5
|
-
automatically. No manual page looping needed.
|
|
6
|
-
|
|
7
|
-
The spider yields structured items (text + author + tags) and exports them to JSON.
|
|
8
|
-
|
|
9
|
-
Best for: multi-page crawls, full-site scraping, anything needing pagination or
|
|
10
|
-
link following across many pages.
|
|
11
|
-
|
|
12
|
-
Outputs:
|
|
13
|
-
- Live stats to terminal during crawl
|
|
14
|
-
- Final crawl stats at the end
|
|
15
|
-
- quotes.json in the current directory
|
|
16
|
-
"""
|
|
17
|
-
|
|
18
|
-
from scrapling.spiders import Spider, Response
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
class QuotesSpider(Spider):
|
|
22
|
-
name = "quotes"
|
|
23
|
-
start_urls = ["https://quotes.toscrape.com/"]
|
|
24
|
-
concurrent_requests = 5 # Fetch up to 5 pages at once
|
|
25
|
-
|
|
26
|
-
async def parse(self, response: Response):
|
|
27
|
-
# Extract all quotes on the current page
|
|
28
|
-
for quote in response.css(".quote"):
|
|
29
|
-
yield {
|
|
30
|
-
"text": quote.css(".text::text").get(),
|
|
31
|
-
"author": quote.css(".author::text").get(),
|
|
32
|
-
"tags": quote.css(".tags .tag::text").getall(),
|
|
33
|
-
}
|
|
34
|
-
|
|
35
|
-
# Follow the "Next" button to the next page (if it exists)
|
|
36
|
-
next_page = response.css(".next a")
|
|
37
|
-
if next_page:
|
|
38
|
-
yield response.follow(next_page[0].attrib["href"])
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
if __name__ == "__main__":
|
|
42
|
-
result = QuotesSpider().start()
|
|
43
|
-
|
|
44
|
-
print(f"\n{'=' * 50}")
|
|
45
|
-
print(f"Scraped : {result.stats.items_scraped} quotes")
|
|
46
|
-
print(f"Requests: {result.stats.requests_count}")
|
|
47
|
-
print(f"Time : {result.stats.elapsed_seconds:.2f}s")
|
|
48
|
-
print(f"Speed : {result.stats.requests_per_second:.2f} req/s")
|
|
49
|
-
print(f"{'=' * 50}\n")
|
|
50
|
-
|
|
51
|
-
for i, item in enumerate(result.items, 1):
|
|
52
|
-
print(f"{i:>3}. [{item['author']}] {item['text']}")
|
|
53
|
-
if item["tags"]:
|
|
54
|
-
print(f" Tags: {', '.join(item['tags'])}")
|
|
55
|
-
|
|
56
|
-
# Export to JSON
|
|
57
|
-
result.items.to_json("quotes.json", indent=True)
|
|
58
|
-
print("\nExported to quotes.json")
|
|
@@ -1,45 +0,0 @@
|
|
|
1
|
-
# Scrapling Examples
|
|
2
|
-
|
|
3
|
-
These examples scrape [quotes.toscrape.com](https://quotes.toscrape.com) - a safe, purpose-built scraping sandbox - and demonstrate every tool available in Scrapling, from plain HTTP to full browser automation and spiders.
|
|
4
|
-
|
|
5
|
-
All examples collect **all 100 quotes across 10 pages**.
|
|
6
|
-
|
|
7
|
-
## Quick Start
|
|
8
|
-
|
|
9
|
-
Make sure Scrapling is installed:
|
|
10
|
-
|
|
11
|
-
```bash
|
|
12
|
-
pip install "scrapling[all]>=0.4.7"
|
|
13
|
-
scrapling install --force
|
|
14
|
-
```
|
|
15
|
-
|
|
16
|
-
## Examples
|
|
17
|
-
|
|
18
|
-
| File | Tool | Type | Best For |
|
|
19
|
-
|--------------------------|-------------------|-----------------------------|---------------------------------------|
|
|
20
|
-
| `01_fetcher_session.py` | `FetcherSession` | Python - persistent HTTP | APIs, fast multi-page scraping |
|
|
21
|
-
| `02_dynamic_session.py` | `DynamicSession` | Python - browser automation | Dynamic/SPA pages |
|
|
22
|
-
| `03_stealthy_session.py` | `StealthySession` | Python - stealth browser | Cloudflare, fingerprint bypass |
|
|
23
|
-
| `04_spider.py` | `Spider` | Python - auto-crawling | Multi-page crawls, full-site scraping |
|
|
24
|
-
|
|
25
|
-
## Running
|
|
26
|
-
|
|
27
|
-
**Python scripts:**
|
|
28
|
-
|
|
29
|
-
```bash
|
|
30
|
-
python examples/01_fetcher_session.py
|
|
31
|
-
python examples/02_dynamic_session.py # Opens a visible browser
|
|
32
|
-
python examples/03_stealthy_session.py # Opens a visible stealth browser
|
|
33
|
-
python examples/04_spider.py # Auto-crawls all pages, exports quotes.json
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
## Escalation Guide
|
|
37
|
-
|
|
38
|
-
Start with the fastest, lightest option and escalate only if needed:
|
|
39
|
-
|
|
40
|
-
```
|
|
41
|
-
get / FetcherSession
|
|
42
|
-
└─ If JS required → fetch / DynamicSession
|
|
43
|
-
└─ If blocked → stealthy-fetch / StealthySession
|
|
44
|
-
└─ If multi-page → Spider
|
|
45
|
-
```
|
|
@@ -1,78 +0,0 @@
|
|
|
1
|
-
# Fetchers basics
|
|
2
|
-
|
|
3
|
-
## Introduction
|
|
4
|
-
Fetchers are classes that do requests or fetch pages in a single-line fashion with many features and return a [Response](#response-object) object. All fetchers have separate session classes to keep the session running (e.g., a browser fetcher keeps the browser open until you finish all requests).
|
|
5
|
-
|
|
6
|
-
Fetchers are not wrappers built on top of other libraries. They use these libraries as an engine to request/fetch pages but add features the underlying engines don't have, while still fully leveraging and optimizing them for web scraping.
|
|
7
|
-
|
|
8
|
-
## Fetchers Overview
|
|
9
|
-
|
|
10
|
-
Scrapling provides three different fetcher classes with their session classes; each fetcher is designed for a specific use case.
|
|
11
|
-
|
|
12
|
-
The following table compares them and can be quickly used for guidance.
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
| Feature | Fetcher | DynamicFetcher | StealthyFetcher |
|
|
16
|
-
|--------------------|---------------------------------------------------|-----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
|
|
17
|
-
| Relative speed | 🐇🐇🐇🐇🐇 | 🐇🐇🐇 | 🐇🐇🐇 |
|
|
18
|
-
| Stealth | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
|
19
|
-
| Anti-Bot options | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
|
|
20
|
-
| JavaScript loading | ❌ | ✅ | ✅ |
|
|
21
|
-
| Memory Usage | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
|
|
22
|
-
| Best used for | Basic scraping when HTTP requests alone can do it | - Dynamically loaded websites <br/>- Small automation<br/>- Small-Mid protections | - Dynamically loaded websites <br/>- Small automation <br/>- Small-Complicated protections |
|
|
23
|
-
| Browser(s) | ❌ | Chromium and Google Chrome | Chromium and Google Chrome |
|
|
24
|
-
| Browser API used | ❌ | PlayWright | PlayWright |
|
|
25
|
-
| Setup Complexity | Simple | Simple | Simple |
|
|
26
|
-
|
|
27
|
-
## Parser configuration in all fetchers
|
|
28
|
-
All fetchers share the same import method, as you will see in the upcoming pages
|
|
29
|
-
```python
|
|
30
|
-
>>> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
|
|
31
|
-
```
|
|
32
|
-
Then you use it right away without initializing like this, and it will use the default parser settings:
|
|
33
|
-
```python
|
|
34
|
-
>>> page = StealthyFetcher.fetch('https://example.com')
|
|
35
|
-
```
|
|
36
|
-
If you want to configure the parser ([Selector class](parsing/main_classes.md#selector)) that will be used on the response before returning it for you, then do this first:
|
|
37
|
-
```python
|
|
38
|
-
>>> from scrapling.fetchers import Fetcher
|
|
39
|
-
>>> Fetcher.configure(adaptive=True, keep_comments=False, keep_cdata=False) # and the rest
|
|
40
|
-
```
|
|
41
|
-
or
|
|
42
|
-
```python
|
|
43
|
-
>>> from scrapling.fetchers import Fetcher
|
|
44
|
-
>>> Fetcher.adaptive=True
|
|
45
|
-
>>> Fetcher.keep_comments=False
|
|
46
|
-
>>> Fetcher.keep_cdata=False # and the rest
|
|
47
|
-
```
|
|
48
|
-
Then, continue your code as usual.
|
|
49
|
-
|
|
50
|
-
The available configuration arguments are: `adaptive`, `adaptive_domain`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
|
|
51
|
-
|
|
52
|
-
**Info:** The `adaptive` argument is disabled by default; you must enable it to use that feature.
|
|
53
|
-
|
|
54
|
-
### Set parser config per request
|
|
55
|
-
As you probably understand, the logic above for setting the parser config will apply globally to all requests/fetches made through that class, and it's intended for simplicity.
|
|
56
|
-
|
|
57
|
-
If your use case requires a different configuration for each request/fetch, you can pass a dictionary to the request method (`fetch`/`get`/`post`/...) to an argument named `selector_config`.
|
|
58
|
-
|
|
59
|
-
## Response Object
|
|
60
|
-
The `Response` object is the same as the [Selector](parsing/main_classes.md#selector) class, but it has additional details about the response, like response headers, status, cookies, etc., as shown below:
|
|
61
|
-
```python
|
|
62
|
-
>>> from scrapling.fetchers import Fetcher
|
|
63
|
-
>>> page = Fetcher.get('https://example.com')
|
|
64
|
-
|
|
65
|
-
>>> page.status # HTTP status code
|
|
66
|
-
>>> page.reason # Status message
|
|
67
|
-
>>> page.cookies # Response cookies as a dictionary
|
|
68
|
-
>>> page.headers # Response headers
|
|
69
|
-
>>> page.request_headers # Request headers
|
|
70
|
-
>>> page.history # Response history of redirections, if any
|
|
71
|
-
>>> page.body # Raw response body as bytes
|
|
72
|
-
>>> page.encoding # Response encoding
|
|
73
|
-
>>> page.meta # Response metadata dictionary (e.g., proxy used). Mainly helpful with the spiders system.
|
|
74
|
-
>>> page.captured_xhr # List of captured XHR/fetch responses (when capture_xhr is enabled on a browser session)
|
|
75
|
-
```
|
|
76
|
-
All fetchers return the `Response` object.
|
|
77
|
-
|
|
78
|
-
**Note:** Unlike the [Selector](parsing/main_classes.md#selector) class, the `Response` class's body is always bytes since v0.4.
|
|
@@ -1,352 +0,0 @@
|
|
|
1
|
-
# Fetching dynamic websites
|
|
2
|
-
|
|
3
|
-
`DynamicFetcher` (formerly `PlayWrightFetcher`) provides flexible browser automation with multiple configuration options and built-in stealth improvements.
|
|
4
|
-
|
|
5
|
-
As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
|
|
6
|
-
|
|
7
|
-
## Basic Usage
|
|
8
|
-
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
|
9
|
-
|
|
10
|
-
```python
|
|
11
|
-
>>> from scrapling.fetchers import DynamicFetcher
|
|
12
|
-
```
|
|
13
|
-
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
|
14
|
-
|
|
15
|
-
**Note:** The async version of the `fetch` method is `async_fetch`.
|
|
16
|
-
|
|
17
|
-
This fetcher provides three main run options that can be combined as desired.
|
|
18
|
-
|
|
19
|
-
Which are:
|
|
20
|
-
|
|
21
|
-
### 1. Vanilla Playwright
|
|
22
|
-
```python
|
|
23
|
-
DynamicFetcher.fetch('https://example.com')
|
|
24
|
-
```
|
|
25
|
-
Using it in that manner will open a Chromium browser and load the page. There are optimizations for speed, and some stealth goes automatically under the hood, but other than that, there are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
|
|
26
|
-
|
|
27
|
-
### 2. Real Chrome
|
|
28
|
-
```python
|
|
29
|
-
DynamicFetcher.fetch('https://example.com', real_chrome=True)
|
|
30
|
-
```
|
|
31
|
-
If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium. This will make your requests look more authentic, so they're less detectable for better results.
|
|
32
|
-
|
|
33
|
-
If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
|
|
34
|
-
```commandline
|
|
35
|
-
playwright install chrome
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
### 3. CDP Connection
|
|
39
|
-
```python
|
|
40
|
-
DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
|
|
41
|
-
```
|
|
42
|
-
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
**Notes:**
|
|
46
|
-
* There was a `stealth` option here, but it was moved to the `StealthyFetcher` class, as explained on the next page, with additional features since version 0.3.13.
|
|
47
|
-
* This makes it less confusing for new users, easier to maintain, and provides other benefits, as explained on the [StealthyFetcher page](stealthy.md).
|
|
48
|
-
|
|
49
|
-
## Full list of arguments
|
|
50
|
-
All arguments for `DynamicFetcher` and its session classes:
|
|
51
|
-
|
|
52
|
-
| Argument | Description | Optional |
|
|
53
|
-
|:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
|
54
|
-
| url | Target url | ❌ |
|
|
55
|
-
| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
|
|
56
|
-
| disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | ✔️ |
|
|
57
|
-
| cookies | Set cookies for the next request. | ✔️ |
|
|
58
|
-
| useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and version.** | ✔️ |
|
|
59
|
-
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
|
60
|
-
| load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
|
|
61
|
-
| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
|
|
62
|
-
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
|
63
|
-
| page_action | Added for automation. Pass a function that takes the `page` object, runs after navigation, and does the necessary automation. | ✔️ |
|
|
64
|
-
| page_setup | A function that takes the `page` object, runs before navigation. Use it to register event listeners or routes that must be set up before the page loads. | ✔️ |
|
|
65
|
-
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
|
66
|
-
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
|
67
|
-
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
|
68
|
-
| google_search | Enabled by default, Scrapling will set a Google referer header. | ✔️ |
|
|
69
|
-
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._ | ✔️ |
|
|
70
|
-
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
|
|
71
|
-
| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
|
|
72
|
-
| locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
|
|
73
|
-
| timezone_id | Changes the timezone of the browser. Defaults to the system timezone. | ✔️ |
|
|
74
|
-
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
|
|
75
|
-
| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
|
|
76
|
-
| extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
|
|
77
|
-
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
|
78
|
-
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
|
79
|
-
| blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
|
|
80
|
-
| block_ads | Block requests to ~3,500 known ad/tracking domains. Can be combined with `blocked_domains`. | ✔️ |
|
|
81
|
-
| dns_over_https | Route DNS queries through Cloudflare's DNS-over-HTTPS to prevent DNS leaks when using proxies. | ✔️ |
|
|
82
|
-
| proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
|
|
83
|
-
| retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
|
|
84
|
-
| retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
|
|
85
|
-
| capture_xhr | Pass a regex URL pattern string to capture XHR/fetch requests matching it during page load. Captured responses are available via `response.captured_xhr`. Defaults to `None` (disabled). | ✔️ |
|
|
86
|
-
| executable_path | Absolute path to a custom browser executable to use instead of the bundled Chromium. Useful for non-standard installations or custom browser builds. | ✔️ |
|
|
87
|
-
|
|
88
|
-
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `page_setup`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `blocked_domains`, `proxy`, and `selector_config`.
|
|
89
|
-
|
|
90
|
-
**Notes:**
|
|
91
|
-
1. The `disable_resources` option made requests ~25% faster in tests for some websites and can help save proxy usage, but be careful with it, as it can cause some websites to never finish loading.
|
|
92
|
-
2. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
|
|
93
|
-
3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
|
|
94
|
-
4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
## Examples
|
|
98
|
-
|
|
99
|
-
### Resource Control
|
|
100
|
-
|
|
101
|
-
```python
|
|
102
|
-
# Disable unnecessary resources
|
|
103
|
-
page = DynamicFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
|
|
104
|
-
```
|
|
105
|
-
|
|
106
|
-
### Domain Blocking
|
|
107
|
-
|
|
108
|
-
```python
|
|
109
|
-
# Block requests to specific domains (and their subdomains)
|
|
110
|
-
page = DynamicFetcher.fetch('https://example.com', blocked_domains={"ads.example.com", "tracker.net"})
|
|
111
|
-
```
|
|
112
|
-
|
|
113
|
-
### Network Control
|
|
114
|
-
|
|
115
|
-
```python
|
|
116
|
-
# Wait for network idle (Consider fetch to be finished when there are no network connections for at least 500 ms)
|
|
117
|
-
page = DynamicFetcher.fetch('https://example.com', network_idle=True)
|
|
118
|
-
|
|
119
|
-
# Custom timeout (in milliseconds)
|
|
120
|
-
page = DynamicFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
|
|
121
|
-
|
|
122
|
-
# Proxy support (It can also be a dictionary with only the keys 'server', 'username', and 'password'.)
|
|
123
|
-
page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port')
|
|
124
|
-
```
|
|
125
|
-
|
|
126
|
-
### Proxy Rotation
|
|
127
|
-
|
|
128
|
-
```python
|
|
129
|
-
from scrapling.fetchers import DynamicSession, ProxyRotator
|
|
130
|
-
|
|
131
|
-
# Set up proxy rotation
|
|
132
|
-
rotator = ProxyRotator([
|
|
133
|
-
"http://proxy1:8080",
|
|
134
|
-
"http://proxy2:8080",
|
|
135
|
-
"http://proxy3:8080",
|
|
136
|
-
])
|
|
137
|
-
|
|
138
|
-
# Use with session - rotates proxy automatically with each request
|
|
139
|
-
with DynamicSession(proxy_rotator=rotator, headless=True) as session:
|
|
140
|
-
page1 = session.fetch('https://example1.com')
|
|
141
|
-
page2 = session.fetch('https://example2.com')
|
|
142
|
-
|
|
143
|
-
# Override rotator for a specific request
|
|
144
|
-
page3 = session.fetch('https://example3.com', proxy='http://specific-proxy:8080')
|
|
145
|
-
```
|
|
146
|
-
|
|
147
|
-
**Warning:** By default, all browser-based fetchers and sessions use a persistent browser context with a pool of tabs. However, since browsers can't set a proxy per tab, when you use a `ProxyRotator`, the fetcher will automatically open a separate context for each proxy, with one tab per context. Once the tab's job is done, both the tab and its context are closed.
|
|
148
|
-
|
|
149
|
-
### Downloading Files
|
|
150
|
-
|
|
151
|
-
```python
|
|
152
|
-
page = DynamicFetcher.fetch('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/main_cover.png')
|
|
153
|
-
|
|
154
|
-
with open(file='main_cover.png', mode='wb') as f:
|
|
155
|
-
f.write(page.body)
|
|
156
|
-
```
|
|
157
|
-
|
|
158
|
-
The `body` attribute of the `Response` object always returns `bytes`.
|
|
159
|
-
|
|
160
|
-
### Pre-Navigation Setup
|
|
161
|
-
If you need to set up event listeners, routes, or scripts that must be registered before the page navigates, use `page_setup`. This function receives the `page` object and runs before `page.goto()` is called.
|
|
162
|
-
|
|
163
|
-
```python
|
|
164
|
-
from playwright.sync_api import Page
|
|
165
|
-
|
|
166
|
-
def capture_websockets(page: Page):
|
|
167
|
-
page.on("websocket", lambda ws: print(f"WebSocket opened: {ws.url}"))
|
|
168
|
-
|
|
169
|
-
page = DynamicFetcher.fetch('https://example.com', page_setup=capture_websockets)
|
|
170
|
-
```
|
|
171
|
-
Async version:
|
|
172
|
-
```python
|
|
173
|
-
from playwright.async_api import Page
|
|
174
|
-
|
|
175
|
-
async def capture_websockets(page: Page):
|
|
176
|
-
page.on("websocket", lambda ws: print(f"WebSocket opened: {ws.url}"))
|
|
177
|
-
|
|
178
|
-
page = await DynamicFetcher.async_fetch('https://example.com', page_setup=capture_websockets)
|
|
179
|
-
```
|
|
180
|
-
|
|
181
|
-
You can combine it with `page_action` -- `page_setup` runs before navigation, `page_action` runs after.
|
|
182
|
-
|
|
183
|
-
### Browser Automation
|
|
184
|
-
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
|
|
185
|
-
|
|
186
|
-
This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for purposes beyond automation. You can alter the page as you want.
|
|
187
|
-
|
|
188
|
-
In the example below, I used the pages' [mouse events](https://playwright.dev/python/docs/api/class-mouse) to scroll the page with the mouse wheel, then move the mouse.
|
|
189
|
-
```python
|
|
190
|
-
from playwright.sync_api import Page
|
|
191
|
-
|
|
192
|
-
def scroll_page(page: Page):
|
|
193
|
-
page.mouse.wheel(10, 0)
|
|
194
|
-
page.mouse.move(100, 400)
|
|
195
|
-
page.mouse.up()
|
|
196
|
-
|
|
197
|
-
page = DynamicFetcher.fetch('https://example.com', page_action=scroll_page)
|
|
198
|
-
```
|
|
199
|
-
Of course, if you use the async fetch version, the function must also be async.
|
|
200
|
-
```python
|
|
201
|
-
from playwright.async_api import Page
|
|
202
|
-
|
|
203
|
-
async def scroll_page(page: Page):
|
|
204
|
-
await page.mouse.wheel(10, 0)
|
|
205
|
-
await page.mouse.move(100, 400)
|
|
206
|
-
await page.mouse.up()
|
|
207
|
-
|
|
208
|
-
page = await DynamicFetcher.async_fetch('https://example.com', page_action=scroll_page)
|
|
209
|
-
```
|
|
210
|
-
|
|
211
|
-
### Wait Conditions
|
|
212
|
-
|
|
213
|
-
```python
|
|
214
|
-
# Wait for the selector
|
|
215
|
-
page = DynamicFetcher.fetch(
|
|
216
|
-
'https://example.com',
|
|
217
|
-
wait_selector='h1',
|
|
218
|
-
wait_selector_state='visible'
|
|
219
|
-
)
|
|
220
|
-
```
|
|
221
|
-
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
|
222
|
-
|
|
223
|
-
After that, if `load_dom` is enabled (the default), the fetcher will check again to see if all JavaScript files are loaded and executed (in the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
|
|
224
|
-
|
|
225
|
-
The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
|
|
226
|
-
|
|
227
|
-
- `attached`: Wait for an element to be present in the DOM.
|
|
228
|
-
- `detached`: Wait for an element to not be present in the DOM.
|
|
229
|
-
- `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
|
|
230
|
-
- `hidden`: wait for an element to be either detached from the DOM, or have an empty bounding box, or `visibility:hidden`. This is opposite to the `'visible'` option.
|
|
231
|
-
|
|
232
|
-
### Capturing XHR/Fetch Requests
|
|
233
|
-
|
|
234
|
-
Many SPAs load data through background API calls (XHR/fetch). You can capture these requests by passing a regex URL pattern to `capture_xhr` at the session level:
|
|
235
|
-
|
|
236
|
-
```python
|
|
237
|
-
from scrapling.fetchers import DynamicSession
|
|
238
|
-
|
|
239
|
-
with DynamicSession(capture_xhr=r"https://api\.example\.com/.*", headless=True) as session:
|
|
240
|
-
page = session.fetch('https://example.com')
|
|
241
|
-
|
|
242
|
-
# Access captured XHR responses
|
|
243
|
-
for xhr in page.captured_xhr:
|
|
244
|
-
print(xhr.url, xhr.status)
|
|
245
|
-
print(xhr.body) # Raw response body as bytes
|
|
246
|
-
```
|
|
247
|
-
|
|
248
|
-
Each item in `captured_xhr` is a full `Response` object with the same properties (`.url`, `.status`, `.headers`, `.body`, etc.). When `capture_xhr` is not set or is `None`, `captured_xhr` is an empty list.
|
|
249
|
-
|
|
250
|
-
### Some Stealth Features
|
|
251
|
-
|
|
252
|
-
```python
|
|
253
|
-
page = DynamicFetcher.fetch(
|
|
254
|
-
'https://example.com',
|
|
255
|
-
google_search=True,
|
|
256
|
-
useragent='Mozilla/5.0...', # Custom user agent
|
|
257
|
-
locale='en-US', # Set browser locale
|
|
258
|
-
)
|
|
259
|
-
```
|
|
260
|
-
|
|
261
|
-
### General example
|
|
262
|
-
```python
|
|
263
|
-
from scrapling.fetchers import DynamicFetcher
|
|
264
|
-
|
|
265
|
-
def scrape_dynamic_content():
|
|
266
|
-
# Use Playwright for JavaScript content
|
|
267
|
-
page = DynamicFetcher.fetch(
|
|
268
|
-
'https://example.com/dynamic',
|
|
269
|
-
network_idle=True,
|
|
270
|
-
wait_selector='.content'
|
|
271
|
-
)
|
|
272
|
-
|
|
273
|
-
# Extract dynamic content
|
|
274
|
-
content = page.css('.content')
|
|
275
|
-
|
|
276
|
-
return {
|
|
277
|
-
'title': content.css('h1::text').get(),
|
|
278
|
-
'items': [
|
|
279
|
-
item.text for item in content.css('.item')
|
|
280
|
-
]
|
|
281
|
-
}
|
|
282
|
-
```
|
|
283
|
-
|
|
284
|
-
## Session Management
|
|
285
|
-
|
|
286
|
-
To keep the browser open until you make multiple requests with the same configuration, use `DynamicSession`/`AsyncDynamicSession` classes. Those classes can accept all the arguments that the `fetch` function can take, which enables you to specify a config for the entire session.
|
|
287
|
-
|
|
288
|
-
```python
|
|
289
|
-
from scrapling.fetchers import DynamicSession
|
|
290
|
-
|
|
291
|
-
# Create a session with default configuration
|
|
292
|
-
with DynamicSession(
|
|
293
|
-
headless=True,
|
|
294
|
-
disable_resources=True,
|
|
295
|
-
real_chrome=True
|
|
296
|
-
) as session:
|
|
297
|
-
# Make multiple requests with the same browser instance
|
|
298
|
-
page1 = session.fetch('https://example1.com')
|
|
299
|
-
page2 = session.fetch('https://example2.com')
|
|
300
|
-
page3 = session.fetch('https://dynamic-site.com')
|
|
301
|
-
|
|
302
|
-
# All requests reuse the same tab on the same browser instance
|
|
303
|
-
```
|
|
304
|
-
|
|
305
|
-
### Async Session Usage
|
|
306
|
-
|
|
307
|
-
```python
|
|
308
|
-
import asyncio
|
|
309
|
-
from scrapling.fetchers import AsyncDynamicSession
|
|
310
|
-
|
|
311
|
-
async def scrape_multiple_sites():
|
|
312
|
-
async with AsyncDynamicSession(
|
|
313
|
-
network_idle=True,
|
|
314
|
-
timeout=30000,
|
|
315
|
-
max_pages=3
|
|
316
|
-
) as session:
|
|
317
|
-
# Make async requests with shared browser configuration
|
|
318
|
-
pages = await asyncio.gather(
|
|
319
|
-
session.fetch('https://spa-app1.com'),
|
|
320
|
-
session.fetch('https://spa-app2.com'),
|
|
321
|
-
session.fetch('https://dynamic-content.com')
|
|
322
|
-
)
|
|
323
|
-
return pages
|
|
324
|
-
```
|
|
325
|
-
|
|
326
|
-
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages that can be displayed at once. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
|
|
327
|
-
|
|
328
|
-
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
|
329
|
-
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
|
|
330
|
-
|
|
331
|
-
This logic allows for multiple URLs to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
|
332
|
-
|
|
333
|
-
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
|
|
334
|
-
|
|
335
|
-
### Session Benefits
|
|
336
|
-
|
|
337
|
-
- **Browser reuse**: Much faster subsequent requests by reusing the same browser instance.
|
|
338
|
-
- **Cookie persistence**: Automatic cookie and session state handling as any browser does automatically.
|
|
339
|
-
- **Consistent fingerprint**: Same browser fingerprint across all requests.
|
|
340
|
-
- **Memory efficiency**: Better resource usage compared to launching new browsers with each fetch.
|
|
341
|
-
|
|
342
|
-
## When to Use
|
|
343
|
-
|
|
344
|
-
Use DynamicFetcher when:
|
|
345
|
-
|
|
346
|
-
- Need browser automation
|
|
347
|
-
- Want multiple browser options
|
|
348
|
-
- Using a real Chrome browser
|
|
349
|
-
- Need custom browser config
|
|
350
|
-
- Want a few stealth options
|
|
351
|
-
|
|
352
|
-
If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
|