ultimate-pi 0.1.2 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agents/skills/ck-search/SKILL.md +99 -0
- package/.agents/skills/defuddle/SKILL.md +90 -0
- package/.agents/skills/find-skills/SKILL.md +142 -0
- package/.agents/skills/firecrawl/SKILL.md +150 -0
- package/.agents/skills/firecrawl/rules/install.md +82 -0
- package/.agents/skills/firecrawl/rules/security.md +26 -0
- package/.agents/skills/firecrawl-agent/SKILL.md +57 -0
- package/.agents/skills/firecrawl-build-interact/SKILL.md +67 -0
- package/.agents/skills/firecrawl-build-onboarding/SKILL.md +102 -0
- package/.agents/skills/firecrawl-build-onboarding/references/auth-flow.md +39 -0
- package/.agents/skills/firecrawl-build-onboarding/references/project-setup.md +20 -0
- package/.agents/skills/firecrawl-build-onboarding/references/sdk-installation.md +17 -0
- package/.agents/skills/firecrawl-build-scrape/SKILL.md +68 -0
- package/.agents/skills/firecrawl-build-search/SKILL.md +68 -0
- package/.agents/skills/firecrawl-crawl/SKILL.md +58 -0
- package/.agents/skills/firecrawl-download/SKILL.md +69 -0
- package/.agents/skills/firecrawl-interact/SKILL.md +83 -0
- package/.agents/skills/firecrawl-map/SKILL.md +50 -0
- package/.agents/skills/firecrawl-parse/SKILL.md +61 -0
- package/.agents/skills/firecrawl-scrape/SKILL.md +68 -0
- package/.agents/skills/firecrawl-search/SKILL.md +59 -0
- package/.agents/skills/obsidian-bases/SKILL.md +299 -0
- package/.agents/skills/obsidian-markdown/SKILL.md +237 -0
- package/.agents/skills/posthog-analyst/SKILL.md +306 -0
- package/.agents/skills/posthog-analyst/evals/evals.json +23 -0
- package/.agents/skills/wiki/SKILL.md +215 -0
- package/.agents/skills/wiki/references/css-snippets.md +122 -0
- package/.agents/skills/wiki/references/frontmatter.md +107 -0
- package/.agents/skills/wiki/references/git-setup.md +58 -0
- package/.agents/skills/wiki/references/mcp-setup.md +149 -0
- package/.agents/skills/wiki/references/modes.md +259 -0
- package/.agents/skills/wiki/references/plugins.md +96 -0
- package/.agents/skills/wiki/references/rest-api.md +124 -0
- package/.agents/skills/wiki-autoresearch/SKILL.md +211 -0
- package/.agents/skills/wiki-autoresearch/references/program.md +75 -0
- package/.agents/skills/wiki-fold/SKILL.md +204 -0
- package/.agents/skills/wiki-fold/references/fold-template.md +133 -0
- package/.agents/skills/wiki-ingest/SKILL.md +288 -0
- package/.agents/skills/wiki-lint/SKILL.md +183 -0
- package/.agents/skills/wiki-query/SKILL.md +176 -0
- package/.agents/skills/wiki-save/SKILL.md +128 -0
- package/.ckignore +41 -0
- package/.env.example +9 -0
- package/.github/workflows/lint.yml +33 -0
- package/.github/workflows/publish-github-packages.yml +35 -0
- package/.github/workflows/publish-npm.yml +1 -1
- package/.pi/SYSTEM.md +107 -40
- package/.pi/agents/pi-pi/agent-expert.md +205 -0
- package/.pi/agents/pi-pi/cli-expert.md +47 -0
- package/.pi/agents/pi-pi/config-expert.md +67 -0
- package/.pi/agents/pi-pi/ext-expert.md +53 -0
- package/.pi/agents/pi-pi/keybinding-expert.md +123 -0
- package/.pi/agents/pi-pi/pi-orchestrator.md +103 -0
- package/.pi/agents/pi-pi/prompt-expert.md +83 -0
- package/.pi/agents/pi-pi/skill-expert.md +52 -0
- package/.pi/agents/pi-pi/theme-expert.md +46 -0
- package/.pi/agents/pi-pi/tui-expert.md +100 -0
- package/.pi/agents/rethink.md +140 -0
- package/.pi/agents/wiki-ingest.md +67 -0
- package/.pi/agents/wiki-lint.md +75 -0
- package/.pi/auto-commit.json +20 -0
- package/.pi/extensions/banner.png +0 -0
- package/.pi/extensions/ck-enforce.ts +216 -0
- package/.pi/extensions/custom-footer.ts +308 -0
- package/.pi/extensions/custom-header.ts +116 -0
- package/.pi/extensions/dotenv-loader.ts +170 -0
- package/.pi/internal/cursor-sdk-transcript-parser.ts +59 -0
- package/.pi/model-router.json +95 -0
- package/.pi/npm/.gitignore +2 -0
- package/.pi/prompts/git-sync.md +124 -0
- package/.pi/prompts/harness-setup.md +509 -0
- package/.pi/prompts/save.md +16 -0
- package/.pi/prompts/wiki-autoresearch.md +19 -0
- package/.pi/prompts/wiki.md +23 -0
- package/.pi/providers/cursor-sdk-provider.test.mjs +476 -0
- package/.pi/providers/cursor-sdk-provider.ts +1085 -0
- package/.pi/settings.json +14 -4
- package/.pi/skills/agent-router/SKILL.md +174 -0
- package/.pi/sounds/alert/1-kaching-track.mp3 +0 -0
- package/.pi/sounds/error/1-ksi-wth-track.mp3 +0 -0
- package/.pi/sounds/error/2-smash-track.mp3 +0 -0
- package/.pi/sounds/error/3-buzzer-track.mp3 +0 -0
- package/.pi/sounds/notification/1-soft-notification-track.mp3 +0 -0
- package/.pi/sounds/project-sounds.json +25 -0
- package/.pi/sounds/reminder/1-soft-notification-track.mp3 +0 -0
- package/.pi/sounds/success/1-tada-track.mp3 +0 -0
- package/.pi/sounds/success/2-jobs-done-track.mp3 +0 -0
- package/.pi/sounds/success/3-yay-track.mp3 +0 -0
- package/CONTRIBUTING.md +116 -0
- package/README.md +32 -39
- package/biome.json +34 -0
- package/firecrawl/.env.template +58 -0
- package/firecrawl/README.md +49 -0
- package/firecrawl/docker-compose.yaml +201 -0
- package/firecrawl/searxng/searxng.env +3 -0
- package/firecrawl/searxng/settings.yml +85 -0
- package/lefthook.yml +8 -0
- package/package.json +55 -24
- package/vault/AGENTS.md +37 -0
- package/vault/wiki/_templates/comparison.md +39 -0
- package/vault/wiki/_templates/concept.md +40 -0
- package/vault/wiki/_templates/decision.md +21 -0
- package/vault/wiki/_templates/entity.md +32 -0
- package/vault/wiki/_templates/flow.md +14 -0
- package/vault/wiki/_templates/module.md +18 -0
- package/vault/wiki/_templates/question.md +31 -0
- package/vault/wiki/_templates/source.md +39 -0
- package/vault/wiki/concepts/AST-Aware Code Chunking.md +44 -0
- package/vault/wiki/concepts/Build-Time Prompt Compilation.md +107 -0
- package/vault/wiki/concepts/Context Engine (AI Coding).md +47 -0
- package/vault/wiki/concepts/Context-Aware System Reminders.md +61 -0
- package/vault/wiki/concepts/Contextualized Text Embedding.md +42 -0
- package/vault/wiki/concepts/Contractor vs Employee AI Model.md +55 -0
- package/vault/wiki/concepts/Dual-Model Agent Architecture.md +65 -0
- package/vault/wiki/concepts/Late Chunking vs Early Chunking.md +43 -0
- package/vault/wiki/concepts/Majority Vote Ensembling.md +68 -0
- package/vault/wiki/concepts/Meta-Harness.md +16 -0
- package/vault/wiki/concepts/Multi-Agent AI Coding Architecture.md +75 -0
- package/vault/wiki/concepts/Prompt Enhancement.md +90 -0
- package/vault/wiki/concepts/Prompt Renderer.md +89 -0
- package/vault/wiki/concepts/Semantic Codebase Indexing.md +67 -0
- package/vault/wiki/concepts/additive-config-hierarchy.md +16 -0
- package/vault/wiki/concepts/agent-artifacts-verifiable-deliverables.md +71 -0
- package/vault/wiki/concepts/agent-browser-browser-automation.md +99 -0
- package/vault/wiki/concepts/agent-codebase-interface.md +43 -0
- package/vault/wiki/concepts/agent-harness-architecture.md +67 -0
- package/vault/wiki/concepts/agent-loop-detection-patterns.md +133 -0
- package/vault/wiki/concepts/agent-search-enforcement.md +126 -0
- package/vault/wiki/concepts/agent-skills-ecosystem.md +74 -0
- package/vault/wiki/concepts/agent-skills-pattern.md +68 -0
- package/vault/wiki/concepts/agentic-harness-context-enforcement.md +91 -0
- package/vault/wiki/concepts/agentic-harness.md +34 -0
- package/vault/wiki/concepts/agentic-orchestration-pipeline.md +56 -0
- package/vault/wiki/concepts/agentic-search-no-embeddings.md +18 -0
- package/vault/wiki/concepts/anthropic-context-engineering.md +13 -0
- package/vault/wiki/concepts/antigravity-agent-first-architecture.md +61 -0
- package/vault/wiki/concepts/ast-compression.md +19 -0
- package/vault/wiki/concepts/ast-truncation.md +66 -0
- package/vault/wiki/concepts/barrel-files.md +37 -0
- package/vault/wiki/concepts/browser-harness-agent.md +41 -0
- package/vault/wiki/concepts/browser-subagent-visual-verification.md +82 -0
- package/vault/wiki/concepts/codebase-intelligence-ecosystem-comparison.md +192 -0
- package/vault/wiki/concepts/codebase-intelligence-harness-integration.md +161 -0
- package/vault/wiki/concepts/codebase-to-context-ingestion.md +46 -0
- package/vault/wiki/concepts/codex-harness-innovations.md +147 -0
- package/vault/wiki/concepts/consensus-debate-flow.md +17 -0
- package/vault/wiki/concepts/consensus-debate.md +206 -0
- package/vault/wiki/concepts/content-addressed-spec-identity.md +166 -0
- package/vault/wiki/concepts/context-anxiety.md +57 -0
- package/vault/wiki/concepts/context-compression-techniques.md +19 -0
- package/vault/wiki/concepts/context-continuity.md +22 -0
- package/vault/wiki/concepts/context-drift-in-agents.md +106 -0
- package/vault/wiki/concepts/context-engineering.md +62 -0
- package/vault/wiki/concepts/context-folding.md +67 -0
- package/vault/wiki/concepts/context-mode.md +38 -0
- package/vault/wiki/concepts/cursor-harness-innovations.md +107 -0
- package/vault/wiki/concepts/deterministic-session-compaction.md +79 -0
- package/vault/wiki/concepts/drift-detection-unified.md +296 -0
- package/vault/wiki/concepts/execution-feedback-loop.md +46 -0
- package/vault/wiki/concepts/feedforward-feedback-harness.md +60 -0
- package/vault/wiki/concepts/five-root-cause-metrics-sentrux.md +40 -0
- package/vault/wiki/concepts/fork-safe-spec-storage.md +89 -0
- package/vault/wiki/concepts/fts5-sandbox.md +19 -0
- package/vault/wiki/concepts/fuzzy-edit-matching.md +71 -0
- package/vault/wiki/concepts/gemini-cli-architecture.md +104 -0
- package/vault/wiki/concepts/generator-evaluator-architecture.md +64 -0
- package/vault/wiki/concepts/guardian-agent-pattern.md +67 -0
- package/vault/wiki/concepts/harness-configuration-layers.md +89 -0
- package/vault/wiki/concepts/harness-control-frameworks.md +155 -0
- package/vault/wiki/concepts/harness-engineering-first-principles.md +90 -0
- package/vault/wiki/concepts/harness-h-formalism.md +53 -0
- package/vault/wiki/concepts/hybrid-code-search.md +61 -0
- package/vault/wiki/concepts/inline-post-edit-validation.md +112 -0
- package/vault/wiki/concepts/legendary-engineering-patterns-harness.md +110 -0
- package/vault/wiki/concepts/lifecycle-hooks.md +94 -0
- package/vault/wiki/concepts/mcp-tool-routing.md +102 -0
- package/vault/wiki/concepts/memory-system-of-record-vs-ephemeral-cache.md +47 -0
- package/vault/wiki/concepts/meta-agent-context-pruning.md +151 -0
- package/vault/wiki/concepts/model-adaptive-harness.md +122 -0
- package/vault/wiki/concepts/model-routing-agents.md +101 -0
- package/vault/wiki/concepts/monorepo-architecture.md +45 -0
- package/vault/wiki/concepts/multi-agent-specialization.md +61 -0
- package/vault/wiki/concepts/permission-subsystem.md +16 -0
- package/vault/wiki/concepts/pi-messenger-analysis.md +243 -0
- package/vault/wiki/concepts/pi-vscode-extension-landscape.md +37 -0
- package/vault/wiki/concepts/policy-engine-pattern.md +78 -0
- package/vault/wiki/concepts/progressive-disclosure-agents.md +53 -0
- package/vault/wiki/concepts/progressive-skill-disclosure.md +17 -0
- package/vault/wiki/concepts/provider-native-prompting.md +203 -0
- package/vault/wiki/concepts/quality-signal-sentrux.md +37 -0
- package/vault/wiki/concepts/repo-map-ranking.md +42 -0
- package/vault/wiki/concepts/result-monad-error-handling.md +47 -0
- package/vault/wiki/concepts/safety-defense-in-depth.md +83 -0
- package/vault/wiki/concepts/sandbox-os-enforcement.md +18 -0
- package/vault/wiki/concepts/selective-debate-routing.md +70 -0
- package/vault/wiki/concepts/self-evolving-harness.md +60 -0
- package/vault/wiki/concepts/sentrux-mcp-integration.md +36 -0
- package/vault/wiki/concepts/sentrux-rules-engine.md +49 -0
- package/vault/wiki/concepts/shell-pattern-compression.md +24 -0
- package/vault/wiki/concepts/skill-first-architecture.md +166 -0
- package/vault/wiki/concepts/structured-compaction.md +78 -0
- package/vault/wiki/concepts/subagent-orchestration.md +17 -0
- package/vault/wiki/concepts/subagent-worktree-isolation.md +68 -0
- package/vault/wiki/concepts/superpowers-methodology.md +78 -0
- package/vault/wiki/concepts/think-in-code.md +73 -0
- package/vault/wiki/concepts/ts-execution-layer.md +100 -0
- package/vault/wiki/concepts/typescript-strict-mode.md +37 -0
- package/vault/wiki/concepts/vcc-conversation-compaction-for-pi.md +51 -0
- package/vault/wiki/concepts/verification-drift-detection.md +19 -0
- package/vault/wiki/consensus/consensus-records.md +58 -0
- package/vault/wiki/decisions/2026-04-30-pi-lean-ctx-native.md +122 -0
- package/vault/wiki/decisions/adr-008.md +40 -0
- package/vault/wiki/decisions/adr-009.md +46 -0
- package/vault/wiki/decisions/adr-010.md +55 -0
- package/vault/wiki/decisions/adr-011.md +165 -0
- package/vault/wiki/decisions/adr-012.md +102 -0
- package/vault/wiki/decisions/adr-013.md +59 -0
- package/vault/wiki/decisions/adr-014.md +73 -0
- package/vault/wiki/decisions/adr-015.md +81 -0
- package/vault/wiki/decisions/adr-016.md +91 -0
- package/vault/wiki/decisions/adr-017.md +79 -0
- package/vault/wiki/decisions/adr-018.md +100 -0
- package/vault/wiki/decisions/adr-019.md +75 -0
- package/vault/wiki/decisions/adr-020.md +106 -0
- package/vault/wiki/decisions/adr-021.md +86 -0
- package/vault/wiki/decisions/adr-022.md +113 -0
- package/vault/wiki/decisions/adr-023.md +113 -0
- package/vault/wiki/decisions/adr-024.md +73 -0
- package/vault/wiki/decisions/adr-025.md +130 -0
- package/vault/wiki/decisions/adr-026.md +56 -0
- package/vault/wiki/decisions/colocate-wiki.md +34 -0
- package/vault/wiki/entities/Anders Hejlsberg.md +29 -0
- package/vault/wiki/entities/Anthropic.md +17 -0
- package/vault/wiki/entities/Augment Code.md +49 -0
- package/vault/wiki/entities/Bjarne Stroustrup.md +26 -0
- package/vault/wiki/entities/Bolt.new (StackBlitz).md +39 -0
- package/vault/wiki/entities/Boris Cherny.md +11 -0
- package/vault/wiki/entities/Claude Code.md +19 -0
- package/vault/wiki/entities/Dennis Ritchie.md +26 -0
- package/vault/wiki/entities/Emergent Labs.md +32 -0
- package/vault/wiki/entities/Google Cloud.md +16 -0
- package/vault/wiki/entities/Guido van Rossum.md +28 -0
- package/vault/wiki/entities/Ken Thompson.md +28 -0
- package/vault/wiki/entities/Lee et al.md +16 -0
- package/vault/wiki/entities/Linus Torvalds.md +28 -0
- package/vault/wiki/entities/Lovable (company).md +40 -0
- package/vault/wiki/entities/Martin Fowler.md +16 -0
- package/vault/wiki/entities/Meng et al.md +16 -0
- package/vault/wiki/entities/OpenAI.md +16 -0
- package/vault/wiki/entities/Rocket.new.md +38 -0
- package/vault/wiki/entities/VILA-Lab.md +15 -0
- package/vault/wiki/entities/autodev-codebase.md +18 -0
- package/vault/wiki/entities/ck-tool.md +59 -0
- package/vault/wiki/entities/codesearch.md +18 -0
- package/vault/wiki/entities/disler-indydevdan.md +33 -0
- package/vault/wiki/entities/gsd-get-shit-done.md +56 -0
- package/vault/wiki/entities/javascript-runtimes.md +48 -0
- package/vault/wiki/entities/jesse-vincent.md +38 -0
- package/vault/wiki/entities/lean-ctx.md +32 -0
- package/vault/wiki/entities/opendev.md +41 -0
- package/vault/wiki/entities/ops-codegraph-tool.md +18 -0
- package/vault/wiki/entities/pi-coding-agent.md +53 -0
- package/vault/wiki/entities/sentrux.md +54 -0
- package/vault/wiki/entities/vgrep-tool.md +57 -0
- package/vault/wiki/entities/vitest.md +41 -0
- package/vault/wiki/flows/harness-wiki-pipeline.md +204 -0
- package/vault/wiki/hot.md +932 -0
- package/vault/wiki/index.md +437 -0
- package/vault/wiki/log.md +418 -0
- package/vault/wiki/meta/dashboard.md +30 -0
- package/vault/wiki/meta/lint-report-2026-04-30.md +86 -0
- package/vault/wiki/meta/lint-report-2026-05-02.md +251 -0
- package/vault/wiki/meta/overview.canvas +43 -0
- package/vault/wiki/modules/adversarial-verification.md +57 -0
- package/vault/wiki/modules/automated-observability.md +54 -0
- package/vault/wiki/modules/bench.md +20 -0
- package/vault/wiki/modules/extensions.md +23 -0
- package/vault/wiki/modules/grounding-checkpoints.md +62 -0
- package/vault/wiki/modules/harness-implementation-plan.md +345 -0
- package/vault/wiki/modules/harness-wiki-skill-mapping.md +135 -0
- package/vault/wiki/modules/harness.md +86 -0
- package/vault/wiki/modules/persistent-memory.md +85 -0
- package/vault/wiki/modules/schema-orchestration.md +68 -0
- package/vault/wiki/modules/skills.md +27 -0
- package/vault/wiki/modules/spec-hardening.md +58 -0
- package/vault/wiki/modules/structured-planning.md +53 -0
- package/vault/wiki/modules/think-in-code-enforcement.md +153 -0
- package/vault/wiki/modules/wiki-query-interface.md +64 -0
- package/vault/wiki/overview.md +51 -0
- package/vault/wiki/questions/Research-pi-vs-claude-code-agentic-orchestration-pipeline.md +87 -0
- package/vault/wiki/questions/Research-sentrux-dev.md +123 -0
- package/vault/wiki/questions/Research-superpowers-skill-for-agentic-coding-agents.md +164 -0
- package/vault/wiki/questions/Research: Augment Code Context Engine.md +244 -0
- package/vault/wiki/questions/Research: Automating Software Engineering - Lovable, Bolt, Emergent, Rocket.md +112 -0
- package/vault/wiki/questions/Research: Claude Code State-of-the-Art Harness Improvements.md +209 -0
- package/vault/wiki/questions/Research: Codex State-of-the-Art Harness Improvements.md +99 -0
- package/vault/wiki/questions/Research: Engineering Workflows of Legendary Programmers and AI Harness Mapping.md +107 -0
- package/vault/wiki/questions/Research: Fallow Codebase Intelligence Harness Integration.md +72 -0
- package/vault/wiki/questions/Research: Gemini CLI SOTA Harness Integration.md +166 -0
- package/vault/wiki/questions/Research: GitHub Issues as Harness Spec Storage.md +188 -0
- package/vault/wiki/questions/Research: Google Antigravity Harness Integration.md +120 -0
- package/vault/wiki/questions/Research: Meta-Agent Context Drift Detection.md +236 -0
- package/vault/wiki/questions/Research: Model-Adaptive Agent Harness Design.md +95 -0
- package/vault/wiki/questions/Research: Model-Specific Prompting Guides.md +165 -0
- package/vault/wiki/questions/Research: Prompt Renderer for Multi-Model Agent Harness.md +216 -0
- package/vault/wiki/questions/Research: Skill-First Harness Architecture.md +91 -0
- package/vault/wiki/questions/Research: TypeScript Best Practices and Codebase Structure.md +88 -0
- package/vault/wiki/questions/Research: TypeScript Execution Layer for Agent Tool Calling.md +81 -0
- package/vault/wiki/questions/Research: claude-mem over Obsidian for Harness Layer.md +71 -0
- package/vault/wiki/questions/Research: claude-mem over obsidian wiki as the knowledge base for our agentic harness pipeline. think from first principles. does this replace or complement our current setup? no hard feelings about previous decisions. gimme accurate points.md +80 -0
- package/vault/wiki/questions/Research: context-mode vs lean-ctx.md +72 -0
- package/vault/wiki/questions/Research: cursor.sh Harness Innovations.md +92 -0
- package/vault/wiki/questions/Research: executor.sh Harness Integration.md +170 -0
- package/vault/wiki/questions/Research: how GSD fits into our coding harness setup.md +97 -0
- package/vault/wiki/questions/Research: how claude-mem fits into our workflow. and whether it should replace obsidian in the codebase. no hard feelings about previous actions, rethink from first principles always.md +80 -0
- package/vault/wiki/questions/Research: pi-vcc.md +113 -0
- package/vault/wiki/questions/Research: semantic code search tools.md +69 -0
- package/vault/wiki/questions/Research: vcc extension for pi coding agent.md +73 -0
- package/vault/wiki/questions/how-to-enable-semantic-code-search-now.md +111 -0
- package/vault/wiki/questions/mvp-implementation-blueprint.md +552 -0
- package/vault/wiki/questions/research-agent-first-codebase-exploration.md +199 -0
- package/vault/wiki/questions/research-agentic-coding-harness-latest-papers.md +142 -0
- package/vault/wiki/questions/research-gitingest-gitreverse-integration.md +100 -0
- package/vault/wiki/questions/research-wozcode-token-reduction.md +67 -0
- package/vault/wiki/questions/resolved-context-pruning-inplace-vs-restart.md +95 -0
- package/vault/wiki/questions/resolved-context-window-economics.md +167 -0
- package/vault/wiki/questions/resolved-imad-debate-gating-transfer.md +126 -0
- package/vault/wiki/questions/resolved-mcp-tool-preference.md +112 -0
- package/vault/wiki/questions/resolved-small-model-meta-agents.md +107 -0
- package/vault/wiki/questions/resolved-treesitter-dynamic-languages.md +95 -0
- package/vault/wiki/sources/Auggie Context MCP Server.md +63 -0
- package/vault/wiki/sources/Augment Code Codacy AI Giants.md +61 -0
- package/vault/wiki/sources/Augment Code MCP SiliconAngle.md +49 -0
- package/vault/wiki/sources/Augment Code WorkOS ERC 2025.md +55 -0
- package/vault/wiki/sources/Augment Context Engine Official.md +71 -0
- package/vault/wiki/sources/Augment SWE-bench Agent GitHub.md +74 -0
- package/vault/wiki/sources/Augment SWE-bench Pro Blog.md +58 -0
- package/vault/wiki/sources/Source: AgentBus Jinja2 Prompt Pipelines.md +75 -0
- package/vault/wiki/sources/Source: Arxiv /342/200/224 Don't Break the Cache.md" +85 -0
- package/vault/wiki/sources/Source: Augment - Harness Engineering for AI Coding Agents.md +58 -0
- package/vault/wiki/sources/Source: Blake Crosley Agent Architecture Guide.md +100 -0
- package/vault/wiki/sources/Source: Bolt.new Architecture & Case Study.md +75 -0
- package/vault/wiki/sources/Source: Build-Time Prompt Compilation Architecture.md +107 -0
- package/vault/wiki/sources/Source: Claude API Agent Skills Overview.md +70 -0
- package/vault/wiki/sources/Source: Gemini CLI Changelogs.md +88 -0
- package/vault/wiki/sources/Source: Google Blog - Gemini CLI Announcement.md +57 -0
- package/vault/wiki/sources/Source: Google Gemini CLI Architecture Docs.md +53 -0
- package/vault/wiki/sources/Source: LangChain - Anatomy of Agent Harness.md +65 -0
- package/vault/wiki/sources/Source: Lovable Architecture & Clone Analysis.md +83 -0
- package/vault/wiki/sources/Source: Martin Fowler - Harness Engineering.md +70 -0
- package/vault/wiki/sources/Source: OpenAI Harness Engineering Five Principles.md +58 -0
- package/vault/wiki/sources/Source: OpenAI Harness Engineering /342/200/224 0 Lines of Human Code.md" +101 -0
- package/vault/wiki/sources/Source: OpenDev /342/200/224 Building AI Coding Agents for the Terminal.md" +100 -0
- package/vault/wiki/sources/Source: Render AI Coding Agents Benchmark 2025.md +53 -0
- package/vault/wiki/sources/Source: Rocket.new /342/200/224 Vibe Solutioning Platform.md" +70 -0
- package/vault/wiki/sources/Source: SwirlAI Agent Skills Progressive Disclosure.md +71 -0
- package/vault/wiki/sources/Source: TianPan Prompt Caching Architecture.md +89 -0
- package/vault/wiki/sources/Source: Vercel Labs agent-browser.md +155 -0
- package/vault/wiki/sources/Source: browser-harness CDP Harness.md +126 -0
- package/vault/wiki/sources/agent-drift-academic-paper.md +79 -0
- package/vault/wiki/sources/aider-repomap-tree-sitter.md +42 -0
- package/vault/wiki/sources/anthropic-compaction-api.md +58 -0
- package/vault/wiki/sources/anthropic-effective-harnesses.md +42 -0
- package/vault/wiki/sources/anthropic-prompt-best-practices.md +100 -0
- package/vault/wiki/sources/anthropic2026-harness-design.md +63 -0
- package/vault/wiki/sources/barrel-files-tkdodo.md +38 -0
- package/vault/wiki/sources/birth-of-unix-kernighan-interview.md +57 -0
- package/vault/wiki/sources/bockeler2026-harness-engineering.md +69 -0
- package/vault/wiki/sources/cast-code-chunking-paper.md +50 -0
- package/vault/wiki/sources/ck-semantic-search.md +78 -0
- package/vault/wiki/sources/claude-code-architecture-karaxai-2026.md +71 -0
- package/vault/wiki/sources/claude-code-architecture-qubytes-2026.md +50 -0
- package/vault/wiki/sources/claude-code-architecture-vila-lab-2026.md +64 -0
- package/vault/wiki/sources/claude-code-security-architecture-penligent-2026.md +70 -0
- package/vault/wiki/sources/claude-context-editing-docs.md +13 -0
- package/vault/wiki/sources/cloudflare-codemode.md +63 -0
- package/vault/wiki/sources/code-chunk-library-supermemory.md +63 -0
- package/vault/wiki/sources/codeact-apple-2024.md +62 -0
- package/vault/wiki/sources/codex-dsc-rfc-8573.md +41 -0
- package/vault/wiki/sources/codex-open-source-agent-2026.md +110 -0
- package/vault/wiki/sources/coir-code-retrieval-benchmark.md +51 -0
- package/vault/wiki/sources/colinmcnamara-context-optimization-codemode.md +48 -0
- package/vault/wiki/sources/context-folding-paper.md +61 -0
- package/vault/wiki/sources/context-mode-website.md +63 -0
- package/vault/wiki/sources/cursor-agent-best-practices-2026.md +62 -0
- package/vault/wiki/sources/cursor-fork-29b-2025.md +50 -0
- package/vault/wiki/sources/cursor-harness-april-2026.md +76 -0
- package/vault/wiki/sources/cursor-instant-apply-2024.md +45 -0
- package/vault/wiki/sources/cursor-shadow-workspace-2024.md +52 -0
- package/vault/wiki/sources/cursor-shipped-coding-agent-2026.md +53 -0
- package/vault/wiki/sources/cursor-vs-antigravity-2026.md +51 -0
- package/vault/wiki/sources/disler-pi-vs-claude-code.md +69 -0
- package/vault/wiki/sources/distill-deterministic-context-compression.md +53 -0
- package/vault/wiki/sources/embedding-models-benchmark-supermemory-2025.md +48 -0
- package/vault/wiki/sources/executor-rhyssullivan.md +122 -0
- package/vault/wiki/sources/fallow-rs-codebase-intelligence.md +125 -0
- package/vault/wiki/sources/fan2025-imad.md +60 -0
- package/vault/wiki/sources/forgecode-gpt5-agent-improvements.md +63 -0
- package/vault/wiki/sources/gemini-3-prompting-guide.md +78 -0
- package/vault/wiki/sources/gh-cli-sub-issue-rfc.md +50 -0
- package/vault/wiki/sources/gh-sub-issue-extension.md +72 -0
- package/vault/wiki/sources/github-fork-issues-discussion.md +44 -0
- package/vault/wiki/sources/github-issue-dependencies-docs.md +49 -0
- package/vault/wiki/sources/github-sub-issues-docs.md +51 -0
- package/vault/wiki/sources/gitingest.md +91 -0
- package/vault/wiki/sources/gitreverse.md +63 -0
- package/vault/wiki/sources/google-antigravity-official-blog.md +47 -0
- package/vault/wiki/sources/google-antigravity-wikipedia.md +53 -0
- package/vault/wiki/sources/gsd-codecentric-deep-dive.md +57 -0
- package/vault/wiki/sources/gsd-github-repo.md +51 -0
- package/vault/wiki/sources/gsd-hn-discussion.md +59 -0
- package/vault/wiki/sources/guido-python-design-philosophy.md +56 -0
- package/vault/wiki/sources/hejlsberg-7-learnings.md +48 -0
- package/vault/wiki/sources/ironclaw-drift-monitor.md +80 -0
- package/vault/wiki/sources/langsight-loop-detection.md +80 -0
- package/vault/wiki/sources/leanctx-website.md +69 -0
- package/vault/wiki/sources/lee2026-meta-harness.md +59 -0
- package/vault/wiki/sources/linux-kernel-coding-workflow.md +50 -0
- package/vault/wiki/sources/lou2026-autoharness.md +53 -0
- package/vault/wiki/sources/martin-fowler-harness-engineering.md +73 -0
- package/vault/wiki/sources/mcp-architecture-docs.md +13 -0
- package/vault/wiki/sources/meng2026-agent-harness-survey.md +79 -0
- package/vault/wiki/sources/mindstudio-four-agent-types.md +68 -0
- package/vault/wiki/sources/ms-chat-history-management.md +13 -0
- package/vault/wiki/sources/openai-prompt-guidance.md +104 -0
- package/vault/wiki/sources/openclaw-session-pruning.md +13 -0
- package/vault/wiki/sources/opencode-dcp.md +13 -0
- package/vault/wiki/sources/opendev-arxiv-2603.05344v1.md +79 -0
- package/vault/wiki/sources/openhands-platform.md +39 -0
- package/vault/wiki/sources/oss-guide-codebase-exploration.md +53 -0
- package/vault/wiki/sources/pi-compaction-extensions-ecosystem.md +102 -0
- package/vault/wiki/sources/pi-context-prune-github-repo.md +38 -0
- package/vault/wiki/sources/pi-mono-compaction-docs.md +38 -0
- package/vault/wiki/sources/pi-omni-compact-github-repo.md +50 -0
- package/vault/wiki/sources/pi-rtk-optimizer-github-repo.md +45 -0
- package/vault/wiki/sources/pi-vcc-github-repo.md +69 -0
- package/vault/wiki/sources/pi-vscode-marketplace.md +41 -0
- package/vault/wiki/sources/pi-vscode-model-provider-marketplace.md +39 -0
- package/vault/wiki/sources/py-tree-sitter.md +13 -0
- package/vault/wiki/sources/sentrux-dev-landing.md +40 -0
- package/vault/wiki/sources/sentrux-docs-pro-architecture.md +75 -0
- package/vault/wiki/sources/sentrux-docs-quality-signal.md +46 -0
- package/vault/wiki/sources/sentrux-docs-root-cause-metrics.md +57 -0
- package/vault/wiki/sources/sentrux-docs-rules-engine.md +58 -0
- package/vault/wiki/sources/sentrux-github-repo.md +56 -0
- package/vault/wiki/sources/superpowers-github-repo.md +56 -0
- package/vault/wiki/sources/superpowers-release-blog.md +54 -0
- package/vault/wiki/sources/superpowers-termdock-analysis.md +45 -0
- package/vault/wiki/sources/swe-agent-aci.md +42 -0
- package/vault/wiki/sources/swe-bench.md +45 -0
- package/vault/wiki/sources/swe-pruner-context-pruning.md +13 -0
- package/vault/wiki/sources/think-in-code-blog.md +48 -0
- package/vault/wiki/sources/tree-sitter-docs.md +13 -0
- package/vault/wiki/sources/ts-best-practices-2025-devto.md +42 -0
- package/vault/wiki/sources/ts-folder-structure-mingyang.md +58 -0
- package/vault/wiki/sources/ts-monorepo-koerselman.md +44 -0
- package/vault/wiki/sources/ts-result-error-handling-kkalamarski.md +52 -0
- package/vault/wiki/sources/ts-runtimes-comparison-betterstack.md +42 -0
- package/vault/wiki/sources/ts-strict-mode-rishikc.md +43 -0
- package/vault/wiki/sources/unix-philosophy.md +48 -0
- package/vault/wiki/sources/vectara-chunking-vs-embedding-naacl2025.md +39 -0
- package/vault/wiki/sources/vectara-guardian-agents.md +79 -0
- package/vault/wiki/sources/vgrep-semantic-search.md +76 -0
- package/vault/wiki/sources/vitest-official.md +41 -0
- package/vault/wiki/sources/vscode-pi-community-extension.md +40 -0
- package/vault/wiki/sources/wozcode.md +79 -0
- package/.agents/skills/compress/SKILL.md +0 -111
- package/.agents/skills/compress/scripts/__init__.py +0 -9
- package/.agents/skills/compress/scripts/__main__.py +0 -3
- package/.agents/skills/compress/scripts/benchmark.py +0 -78
- package/.agents/skills/compress/scripts/cli.py +0 -73
- package/.agents/skills/compress/scripts/compress.py +0 -227
- package/.agents/skills/compress/scripts/detect.py +0 -121
- package/.agents/skills/compress/scripts/validate.py +0 -189
- package/.agents/skills/emil-design-eng/SKILL.md +0 -679
- package/.agents/skills/lean-ctx/SKILL.md +0 -149
- package/.agents/skills/lean-ctx/scripts/install.sh +0 -95
- package/.agents/skills/scrapling-official/LICENSE.txt +0 -28
- package/.agents/skills/scrapling-official/SKILL.md +0 -390
- package/.agents/skills/scrapling-official/examples/01_fetcher_session.py +0 -26
- package/.agents/skills/scrapling-official/examples/02_dynamic_session.py +0 -26
- package/.agents/skills/scrapling-official/examples/03_stealthy_session.py +0 -26
- package/.agents/skills/scrapling-official/examples/04_spider.py +0 -58
- package/.agents/skills/scrapling-official/examples/README.md +0 -45
- package/.agents/skills/scrapling-official/references/fetching/choosing.md +0 -78
- package/.agents/skills/scrapling-official/references/fetching/dynamic.md +0 -352
- package/.agents/skills/scrapling-official/references/fetching/static.md +0 -432
- package/.agents/skills/scrapling-official/references/fetching/stealthy.md +0 -255
- package/.agents/skills/scrapling-official/references/mcp-server.md +0 -214
- package/.agents/skills/scrapling-official/references/migrating_from_beautifulsoup.md +0 -86
- package/.agents/skills/scrapling-official/references/parsing/adaptive.md +0 -212
- package/.agents/skills/scrapling-official/references/parsing/main_classes.md +0 -586
- package/.agents/skills/scrapling-official/references/parsing/selection.md +0 -494
- package/.agents/skills/scrapling-official/references/spiders/advanced.md +0 -344
- package/.agents/skills/scrapling-official/references/spiders/architecture.md +0 -94
- package/.agents/skills/scrapling-official/references/spiders/getting-started.md +0 -164
- package/.agents/skills/scrapling-official/references/spiders/proxy-blocking.md +0 -235
- package/.agents/skills/scrapling-official/references/spiders/requests-responses.md +0 -196
- package/.agents/skills/scrapling-official/references/spiders/sessions.md +0 -205
- package/PLAN.md +0 -11
- package/extensions/lean-ctx-enforce.ts +0 -166
- package/skills-lock.json +0 -35
- package/wiki/README.md +0 -19
- package/wiki/decisions/0001-establish-project-wiki-and-decision-record-format.md +0 -25
- package/wiki/decisions/0002-add-project-banner-to-readme.md +0 -26
- package/wiki/decisions/0003-remove-redundant-readme-title-heading.md +0 -26
- package/wiki/decisions/0004-publish-package-to-npm-as-ultimate-pi.md +0 -26
- package/wiki/decisions/0005-automate-npm-publish-with-github-actions.md +0 -27
- package/wiki/decisions/0006-switch-to-npm-trusted-publishing.md +0 -26
- package/wiki/decisions/0007-use-absolute-banner-url-for-npm-readme-rendering.md +0 -26
- package/wiki/decisions/0008-rename-banner-asset-for-cache-busting.md +0 -26
- package/wiki/decisions/0009-force-oidc-path-by-clearing-node-auth-token-in-publish-step.md +0 -25
- package/wiki/decisions/0010-simplify-setup-node-for-npm-trusted-publishing.md +0 -26
- package/wiki/decisions/0011-add-noop-workflow-change-to-force-fresh-publish-run.md +0 -25
- package/wiki/decisions/0012-align-workflow-runtime-with-npm-trusted-publishing-requirements.md +0 -26
- package/wiki/decisions/0013-add-package-repository-url-for-provenance-validation.md +0 -25
|
@@ -1,586 +0,0 @@
|
|
|
1
|
-
# Parsing main classes
|
|
2
|
-
|
|
3
|
-
The [Selector](#selector) class is the core parsing engine in Scrapling, providing HTML parsing and element selection capabilities. You can always import it with any of the following imports
|
|
4
|
-
```python
|
|
5
|
-
from scrapling import Selector
|
|
6
|
-
from scrapling.parser import Selector
|
|
7
|
-
```
|
|
8
|
-
Usage:
|
|
9
|
-
```python
|
|
10
|
-
page = Selector(
|
|
11
|
-
'<html>...</html>',
|
|
12
|
-
url='https://example.com'
|
|
13
|
-
)
|
|
14
|
-
|
|
15
|
-
# Then select elements as you like
|
|
16
|
-
elements = page.css('.product')
|
|
17
|
-
```
|
|
18
|
-
In Scrapling, the main object you deal with after passing an HTML source or fetching a website is, of course, a [Selector](#selector) object. Any operation you do, like selection, navigation, etc., will return either a [Selector](#selector) object or a [Selectors](#selectors) object, given that the result is element/elements from the page, not text or similar.
|
|
19
|
-
|
|
20
|
-
The main page is a [Selector](#selector) object, and the elements within are [Selector](#selector) objects. Any text (text content inside elements or attribute values) is a [TextHandler](#texthandler) object, and element attributes are stored as [AttributesHandler](#attributeshandler).
|
|
21
|
-
|
|
22
|
-
## Selector
|
|
23
|
-
### Arguments explained
|
|
24
|
-
The most important one is `content`, it's used to pass the HTML code you want to parse, and it accepts the HTML content as `str` or `bytes`.
|
|
25
|
-
|
|
26
|
-
The arguments `url`, `adaptive`, `storage`, and `storage_args` are settings used with the `adaptive` feature. They are explained in the [adaptive](adaptive.md) feature page.
|
|
27
|
-
|
|
28
|
-
Arguments for parsing adjustments:
|
|
29
|
-
|
|
30
|
-
- **encoding**: This is the encoding that will be used while parsing the HTML. The default is `UTF-8`.
|
|
31
|
-
- **keep_comments**: This tells the library whether to keep HTML comments while parsing the page. It's disabled by default because it can cause issues with your scraping in various ways.
|
|
32
|
-
- **keep_cdata**: Same logic as the HTML comments. [cdata](https://stackoverflow.com/questions/7092236/what-is-cdata-in-html) is removed by default for cleaner HTML.
|
|
33
|
-
|
|
34
|
-
The arguments `huge_tree` and `root` are advanced features not covered here.
|
|
35
|
-
|
|
36
|
-
Most properties on the main page and its elements are lazily loaded (not initialized until accessed), which contributes to Scrapling's speed.
|
|
37
|
-
|
|
38
|
-
### Properties
|
|
39
|
-
Properties for traversal are separated in the [traversal](#traversal) section below.
|
|
40
|
-
|
|
41
|
-
Parsing this HTML page as an example:
|
|
42
|
-
```html
|
|
43
|
-
<html>
|
|
44
|
-
<head>
|
|
45
|
-
<title>Some page</title>
|
|
46
|
-
</head>
|
|
47
|
-
<body>
|
|
48
|
-
<div class="product-list">
|
|
49
|
-
<article class="product" data-id="1">
|
|
50
|
-
<h3>Product 1</h3>
|
|
51
|
-
<p class="description">This is product 1</p>
|
|
52
|
-
<span class="price">$10.99</span>
|
|
53
|
-
<div class="hidden stock">In stock: 5</div>
|
|
54
|
-
</article>
|
|
55
|
-
|
|
56
|
-
<article class="product" data-id="2">
|
|
57
|
-
<h3>Product 2</h3>
|
|
58
|
-
<p class="description">This is product 2</p>
|
|
59
|
-
<span class="price">$20.99</span>
|
|
60
|
-
<div class="hidden stock">In stock: 3</div>
|
|
61
|
-
</article>
|
|
62
|
-
|
|
63
|
-
<article class="product" data-id="3">
|
|
64
|
-
<h3>Product 3</h3>
|
|
65
|
-
<p class="description">This is product 3</p>
|
|
66
|
-
<span class="price">$15.99</span>
|
|
67
|
-
<div class="hidden stock">Out of stock</div>
|
|
68
|
-
</article>
|
|
69
|
-
</div>
|
|
70
|
-
|
|
71
|
-
<script id="page-data" type="application/json">
|
|
72
|
-
{
|
|
73
|
-
"lastUpdated": "2024-09-22T10:30:00Z",
|
|
74
|
-
"totalProducts": 3
|
|
75
|
-
}
|
|
76
|
-
</script>
|
|
77
|
-
</body>
|
|
78
|
-
</html>
|
|
79
|
-
```
|
|
80
|
-
Load the page directly as shown before:
|
|
81
|
-
```python
|
|
82
|
-
from scrapling import Selector
|
|
83
|
-
page = Selector(html_doc)
|
|
84
|
-
```
|
|
85
|
-
Get all text content on the page recursively
|
|
86
|
-
```python
|
|
87
|
-
>>> page.get_all_text()
|
|
88
|
-
'Some page\n\n \n\n \nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
|
|
89
|
-
```
|
|
90
|
-
Get the first article (used as an example throughout):
|
|
91
|
-
```python
|
|
92
|
-
article = page.find('article')
|
|
93
|
-
```
|
|
94
|
-
With the same logic, get all text content on the element recursively
|
|
95
|
-
```python
|
|
96
|
-
>>> article.get_all_text()
|
|
97
|
-
'Product 1\nThis is product 1\n$10.99\nIn stock: 5'
|
|
98
|
-
```
|
|
99
|
-
But if you try to get the direct text content, it will be empty because it doesn't have direct text in the HTML code above
|
|
100
|
-
```python
|
|
101
|
-
>>> article.text
|
|
102
|
-
''
|
|
103
|
-
```
|
|
104
|
-
The `get_all_text` method has the following optional arguments:
|
|
105
|
-
|
|
106
|
-
1. **separator**: All strings collected will be concatenated using this separator. The default is '\n'.
|
|
107
|
-
2. **strip**: If enabled, strings will be stripped before concatenation. Disabled by default.
|
|
108
|
-
3. **ignore_tags**: A tuple of all tag names you want to ignore in the final results and ignore any elements nested within them. The default is `('script', 'style',)`.
|
|
109
|
-
4. **valid_values**: If enabled, the method will only collect elements with real values, so all elements with empty text content or only whitespaces will be ignored. It's enabled by default
|
|
110
|
-
|
|
111
|
-
The text returned is a [TextHandler](#texthandler), not a standard string. If the text content can be serialized to JSON, use `.json()` on it:
|
|
112
|
-
```python
|
|
113
|
-
>>> script = page.find('script')
|
|
114
|
-
>>> script.json()
|
|
115
|
-
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
|
116
|
-
```
|
|
117
|
-
Let's continue to get the element tag
|
|
118
|
-
```python
|
|
119
|
-
>>> article.tag
|
|
120
|
-
'article'
|
|
121
|
-
```
|
|
122
|
-
Using it on the page directly operates on the root `html` element:
|
|
123
|
-
```python
|
|
124
|
-
>>> page.tag
|
|
125
|
-
'html'
|
|
126
|
-
```
|
|
127
|
-
Getting the attributes of the element
|
|
128
|
-
```python
|
|
129
|
-
>>> print(article.attrib)
|
|
130
|
-
{'class': 'product', 'data-id': '1'}
|
|
131
|
-
```
|
|
132
|
-
Access a specific attribute with any of the following
|
|
133
|
-
```python
|
|
134
|
-
>>> article.attrib['class']
|
|
135
|
-
>>> article.attrib.get('class')
|
|
136
|
-
>>> article['class'] # new in v0.3
|
|
137
|
-
```
|
|
138
|
-
Check if the attributes contain a specific attribute with any of the methods below
|
|
139
|
-
```python
|
|
140
|
-
>>> 'class' in article.attrib
|
|
141
|
-
>>> 'class' in article # new in v0.3
|
|
142
|
-
```
|
|
143
|
-
Get the HTML content of the element
|
|
144
|
-
```python
|
|
145
|
-
>>> article.html_content
|
|
146
|
-
'<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
|
|
147
|
-
```
|
|
148
|
-
Get the prettified version of the element's HTML content
|
|
149
|
-
```python
|
|
150
|
-
print(article.prettify())
|
|
151
|
-
```
|
|
152
|
-
```html
|
|
153
|
-
<article class="product" data-id="1"><h3>Product 1</h3>
|
|
154
|
-
<p class="description">This is product 1</p>
|
|
155
|
-
<span class="price">$10.99</span>
|
|
156
|
-
<div class="hidden stock">In stock: 5</div>
|
|
157
|
-
</article>
|
|
158
|
-
```
|
|
159
|
-
Use the `.body` property to get the raw content of the page. Starting from v0.4, when used on a `Response` object from fetchers, `.body` always returns `bytes`.
|
|
160
|
-
```python
|
|
161
|
-
>>> page.body
|
|
162
|
-
'<html>\n <head>\n <title>Some page</title>\n </head>\n ...'
|
|
163
|
-
```
|
|
164
|
-
To get all the ancestors in the DOM tree of this element
|
|
165
|
-
```python
|
|
166
|
-
>>> article.path
|
|
167
|
-
[<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>,
|
|
168
|
-
<data='<body> <div class="product-list"> <artic...' parent='<html><head><title>Some page</title></he...'>,
|
|
169
|
-
<data='<html><head><title>Some page</title></he...'>]
|
|
170
|
-
```
|
|
171
|
-
Generate a CSS shortened selector if possible, or generate the full selector
|
|
172
|
-
```python
|
|
173
|
-
>>> article.generate_css_selector
|
|
174
|
-
'body > div > article'
|
|
175
|
-
>>> article.generate_full_css_selector
|
|
176
|
-
'body > div > article'
|
|
177
|
-
```
|
|
178
|
-
Same case with XPath
|
|
179
|
-
```python
|
|
180
|
-
>>> article.generate_xpath_selector
|
|
181
|
-
"//body/div/article"
|
|
182
|
-
>>> article.generate_full_xpath_selector
|
|
183
|
-
"//body/div/article"
|
|
184
|
-
```
|
|
185
|
-
|
|
186
|
-
### Traversal
|
|
187
|
-
Properties and methods for navigating elements on the page.
|
|
188
|
-
|
|
189
|
-
The `html` element is the root of the website's tree. Elements like `head` and `body` are "children" of `html`, and `html` is their "parent". The element `body` is a "sibling" of `head` and vice versa.
|
|
190
|
-
|
|
191
|
-
Accessing the parent of an element
|
|
192
|
-
```python
|
|
193
|
-
>>> article.parent
|
|
194
|
-
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
|
195
|
-
>>> article.parent.tag
|
|
196
|
-
'div'
|
|
197
|
-
```
|
|
198
|
-
Chaining is supported, as with all similar properties/methods:
|
|
199
|
-
```python
|
|
200
|
-
>>> article.parent.parent.tag
|
|
201
|
-
'body'
|
|
202
|
-
```
|
|
203
|
-
Get the children of an element
|
|
204
|
-
```python
|
|
205
|
-
>>> article.children
|
|
206
|
-
[<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
|
|
207
|
-
<data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
|
|
208
|
-
<data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
|
|
209
|
-
<data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>]
|
|
210
|
-
```
|
|
211
|
-
Get all elements underneath an element. It acts as a nested version of the `children` property
|
|
212
|
-
```python
|
|
213
|
-
>>> article.below_elements
|
|
214
|
-
[<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
|
|
215
|
-
<data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
|
|
216
|
-
<data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
|
|
217
|
-
<data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>]
|
|
218
|
-
```
|
|
219
|
-
This element returns the same result as the `children` property because its children don't have children.
|
|
220
|
-
|
|
221
|
-
Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
|
|
222
|
-
```python
|
|
223
|
-
>>> products_list = page.css('.product-list')[0]
|
|
224
|
-
>>> products_list.children
|
|
225
|
-
[<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
|
|
226
|
-
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
|
227
|
-
<data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
|
|
228
|
-
|
|
229
|
-
>>> products_list.below_elements
|
|
230
|
-
[<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
|
|
231
|
-
<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
|
|
232
|
-
<data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
|
|
233
|
-
<data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
|
|
234
|
-
<data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>,
|
|
235
|
-
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
|
236
|
-
...]
|
|
237
|
-
```
|
|
238
|
-
Get the siblings of an element
|
|
239
|
-
```python
|
|
240
|
-
>>> article.siblings
|
|
241
|
-
[<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
|
242
|
-
<data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
|
|
243
|
-
```
|
|
244
|
-
Get the next element of the current element
|
|
245
|
-
```python
|
|
246
|
-
>>> article.next
|
|
247
|
-
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>
|
|
248
|
-
```
|
|
249
|
-
The same logic applies to the `previous` property
|
|
250
|
-
```python
|
|
251
|
-
>>> article.previous # It's the first child, so it doesn't have a previous element
|
|
252
|
-
>>> second_article = page.css('.product[data-id="2"]')[0]
|
|
253
|
-
>>> second_article.previous
|
|
254
|
-
<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
|
|
255
|
-
```
|
|
256
|
-
Check if an element has a specific class name:
|
|
257
|
-
```python
|
|
258
|
-
>>> article.has_class('product')
|
|
259
|
-
True
|
|
260
|
-
```
|
|
261
|
-
Iterate over the entire ancestors' tree of any element:
|
|
262
|
-
```python
|
|
263
|
-
for ancestor in article.iterancestors():
|
|
264
|
-
# do something with it...
|
|
265
|
-
```
|
|
266
|
-
Search for a specific ancestor that satisfies a search function. Pass a function that takes a [Selector](#selector) object as an argument and returns `True`/`False`:
|
|
267
|
-
```python
|
|
268
|
-
>>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
|
|
269
|
-
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
|
270
|
-
|
|
271
|
-
>>> article.find_ancestor(lambda ancestor: ancestor.css('.product-list')) # Same result, different approach
|
|
272
|
-
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
|
273
|
-
```
|
|
274
|
-
## Selectors
|
|
275
|
-
The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
|
|
276
|
-
|
|
277
|
-
In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
|
|
278
|
-
|
|
279
|
-
Starting with v0.4, all selection methods consistently return [Selector](#selector)/[Selectors](#selectors) objects, even for text nodes and attribute values. Text nodes (selected via `::text`, `/text()`, `::attr()`, `/@attr`) are wrapped in [Selector](#selector) objects. These text node selectors have `tag` set to `"#text"`, and their `text` property returns the text value. You can still access the text value directly, and all other properties return empty/default values gracefully.
|
|
280
|
-
|
|
281
|
-
```python
|
|
282
|
-
>>> page.css('a::text') # -> Selectors (of text node Selectors)
|
|
283
|
-
>>> page.xpath('//a/text()') # -> Selectors
|
|
284
|
-
>>> page.css('a::text').get() # -> TextHandler (the first text value)
|
|
285
|
-
>>> page.css('a::text').getall() # -> TextHandlers (all text values)
|
|
286
|
-
>>> page.css('a::attr(href)') # -> Selectors
|
|
287
|
-
>>> page.xpath('//a/@href') # -> Selectors
|
|
288
|
-
>>> page.css('.price_color') # -> Selectors
|
|
289
|
-
```
|
|
290
|
-
|
|
291
|
-
### Data extraction methods
|
|
292
|
-
Starting with v0.4, [Selector](#selector) and [Selectors](#selectors) both provide `get()`, `getall()`, and their aliases `extract_first` and `extract` (following Scrapy conventions). The old `get_all()` method has been removed.
|
|
293
|
-
|
|
294
|
-
**On a [Selector](#selector) object:**
|
|
295
|
-
|
|
296
|
-
- `get()` returns a `TextHandler`: for text node selectors, it returns the text value; for HTML element selectors, it returns the serialized outer HTML.
|
|
297
|
-
- `getall()` returns a `TextHandlers` list containing the single serialized string.
|
|
298
|
-
- `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
|
|
299
|
-
|
|
300
|
-
```python
|
|
301
|
-
>>> page.css('h3')[0].get() # Outer HTML of the element
|
|
302
|
-
'<h3>Product 1</h3>'
|
|
303
|
-
|
|
304
|
-
>>> page.css('h3::text')[0].get() # Text value of the text node
|
|
305
|
-
'Product 1'
|
|
306
|
-
```
|
|
307
|
-
|
|
308
|
-
**On a [Selectors](#selectors) object:**
|
|
309
|
-
|
|
310
|
-
- `get(default=None)` returns the serialized string of the **first** element, or `default` if the list is empty.
|
|
311
|
-
- `getall()` serializes **all** elements and returns a `TextHandlers` list.
|
|
312
|
-
- `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
|
|
313
|
-
|
|
314
|
-
```python
|
|
315
|
-
>>> page.css('.price::text').get() # First price text
|
|
316
|
-
'$10.99'
|
|
317
|
-
|
|
318
|
-
>>> page.css('.price::text').getall() # All price texts
|
|
319
|
-
['$10.99', '$20.99', '$15.99']
|
|
320
|
-
|
|
321
|
-
>>> page.css('.price::text').get('') # With default value
|
|
322
|
-
'$10.99'
|
|
323
|
-
```
|
|
324
|
-
|
|
325
|
-
These methods work seamlessly with all selection types (CSS, XPath, `find`, etc.) and are the recommended way to extract text and attribute values in a Scrapy-compatible style.
|
|
326
|
-
|
|
327
|
-
### Properties
|
|
328
|
-
Apart from the standard operations on Python lists (iteration, slicing, etc.), the following operations are available:
|
|
329
|
-
|
|
330
|
-
CSS and XPath selectors can be executed directly on the [Selector](#selector) instances, with the same return types as [Selector](#selector)'s `css` and `xpath` methods. The arguments are similar, except the `adaptive` argument is not available. This makes chaining methods straightforward:
|
|
331
|
-
```python
|
|
332
|
-
>>> page.css('.product_pod a')
|
|
333
|
-
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
|
|
334
|
-
<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
|
|
335
|
-
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<div class="image_container"> <a href="c...'>,
|
|
336
|
-
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>,
|
|
337
|
-
<data='<a href="catalogue/soumission_998/index....' parent='<div class="image_container"> <a href="c...'>,
|
|
338
|
-
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
|
|
339
|
-
...]
|
|
340
|
-
|
|
341
|
-
>>> page.css('.product_pod').css('a') # Returns the same result
|
|
342
|
-
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
|
|
343
|
-
<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
|
|
344
|
-
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<div class="image_container"> <a href="c...'>,
|
|
345
|
-
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>,
|
|
346
|
-
<data='<a href="catalogue/soumission_998/index....' parent='<div class="image_container"> <a href="c...'>,
|
|
347
|
-
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
|
|
348
|
-
...]
|
|
349
|
-
```
|
|
350
|
-
The `re` and `re_first` methods can be run directly. They take the same arguments as the [Selector](#selector) class. In this class, `re_first` runs `re` on each [Selector](#selector) within and returns the first one with a result. The `re` method returns a [TextHandlers](#texthandlers) object combining all matches:
|
|
351
|
-
```python
|
|
352
|
-
>>> page.css('.price_color').re(r'[\d\.]+')
|
|
353
|
-
['51.77',
|
|
354
|
-
'53.74',
|
|
355
|
-
'50.10',
|
|
356
|
-
'47.82',
|
|
357
|
-
'54.23',
|
|
358
|
-
...]
|
|
359
|
-
|
|
360
|
-
>>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
|
|
361
|
-
['a-light-in-the-attic_1000',
|
|
362
|
-
'tipping-the-velvet_999',
|
|
363
|
-
'soumission_998',
|
|
364
|
-
'sharp-objects_997',
|
|
365
|
-
...]
|
|
366
|
-
```
|
|
367
|
-
The `search` method searches the available [Selector](#selector) instances. The function passed must accept a [Selector](#selector) instance as the first argument and return True/False. Returns the first matching [Selector](#selector) instance, or `None`:
|
|
368
|
-
```python
|
|
369
|
-
# Find all the products with price '53.23'.
|
|
370
|
-
>>> search_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
|
|
371
|
-
>>> page.css('.product_pod').search(search_function)
|
|
372
|
-
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>
|
|
373
|
-
```
|
|
374
|
-
The `filter` method takes a function like `search` but returns a `Selectors` instance of all matching [Selector](#selector) instances:
|
|
375
|
-
```python
|
|
376
|
-
# Find all products with prices over $50
|
|
377
|
-
>>> filtering_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
|
|
378
|
-
>>> page.css('.product_pod').filter(filtering_function)
|
|
379
|
-
[<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
|
380
|
-
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
|
381
|
-
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
|
382
|
-
...]
|
|
383
|
-
```
|
|
384
|
-
Safe access to the first or last element without index errors:
|
|
385
|
-
```python
|
|
386
|
-
>>> page.css('.product').first # First Selector or None
|
|
387
|
-
<data='<article class="product" data-id="1"><h3...'>
|
|
388
|
-
>>> page.css('.product').last # Last Selector or None
|
|
389
|
-
<data='<article class="product" data-id="3"><h3...'>
|
|
390
|
-
>>> page.css('.nonexistent').first # Returns None instead of raising IndexError
|
|
391
|
-
```
|
|
392
|
-
|
|
393
|
-
Get the number of [Selector](#selector) instances in a [Selectors](#selectors) instance:
|
|
394
|
-
```python
|
|
395
|
-
page.css('.product_pod').length
|
|
396
|
-
```
|
|
397
|
-
which is equivalent to
|
|
398
|
-
```python
|
|
399
|
-
len(page.css('.product_pod'))
|
|
400
|
-
```
|
|
401
|
-
|
|
402
|
-
## TextHandler
|
|
403
|
-
All methods/properties that return a string return `TextHandler`, and those that return a list of strings return [TextHandlers](#texthandlers) instead.
|
|
404
|
-
|
|
405
|
-
TextHandler is a subclass of the standard Python string, so all standard string operations are supported.
|
|
406
|
-
|
|
407
|
-
TextHandler provides extra methods and properties beyond standard Python strings. All methods and properties in all classes that return string(s) return TextHandler, enabling chaining and cleaner code. It can also be imported directly and used on any string.
|
|
408
|
-
### Usage
|
|
409
|
-
All operations (slicing, indexing, etc.) and methods (`split`, `replace`, `strip`, etc.) return a `TextHandler`, so they can be chained.
|
|
410
|
-
|
|
411
|
-
The `re` and `re_first` methods exist in [Selector](#selector), [Selectors](#selectors), and [TextHandlers](#texthandlers) as well, accepting the same arguments.
|
|
412
|
-
|
|
413
|
-
- The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments but returns only the first result as a `TextHandler` instance.
|
|
414
|
-
|
|
415
|
-
Also, it takes other helpful arguments, which are:
|
|
416
|
-
|
|
417
|
-
- **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
|
|
418
|
-
- **clean_match**: It's disabled by default. This causes the method to ignore all whitespace, including consecutive spaces, while matching.
|
|
419
|
-
- **case_sensitive**: It's enabled by default. As the name implies, disabling it causes the regex to ignore letter case during compilation.
|
|
420
|
-
|
|
421
|
-
The return result is [TextHandlers](#texthandlers) because the `re` method is used:
|
|
422
|
-
```python
|
|
423
|
-
>>> page.css('.price_color').re(r'[\d\.]+')
|
|
424
|
-
['51.77',
|
|
425
|
-
'53.74',
|
|
426
|
-
'50.10',
|
|
427
|
-
'47.82',
|
|
428
|
-
'54.23',
|
|
429
|
-
...]
|
|
430
|
-
|
|
431
|
-
>>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
|
|
432
|
-
['a-light-in-the-attic_1000',
|
|
433
|
-
'tipping-the-velvet_999',
|
|
434
|
-
'soumission_998',
|
|
435
|
-
'sharp-objects_997',
|
|
436
|
-
...]
|
|
437
|
-
```
|
|
438
|
-
Examples with custom strings demonstrating the other arguments:
|
|
439
|
-
```python
|
|
440
|
-
>>> from scrapling import TextHandler
|
|
441
|
-
>>> test_string = TextHandler('hi there') # Hence the two spaces
|
|
442
|
-
>>> test_string.re('hi there')
|
|
443
|
-
>>> test_string.re('hi there', clean_match=True) # Using `clean_match` will clean the string before matching the regex
|
|
444
|
-
['hi there']
|
|
445
|
-
|
|
446
|
-
>>> test_string2 = TextHandler('Oh, Hi Mark')
|
|
447
|
-
>>> test_string2.re_first('oh, hi Mark')
|
|
448
|
-
>>> test_string2.re_first('oh, hi Mark', case_sensitive=False) # Hence disabling `case_sensitive`
|
|
449
|
-
'Oh, Hi Mark'
|
|
450
|
-
|
|
451
|
-
# Mixing arguments
|
|
452
|
-
>>> test_string.re('hi there', clean_match=True, case_sensitive=False)
|
|
453
|
-
['hi There']
|
|
454
|
-
```
|
|
455
|
-
Since `html_content` returns `TextHandler`, regex can be applied directly on HTML content:
|
|
456
|
-
```python
|
|
457
|
-
>>> page.html_content.re('div class=".*">(.*)</div')
|
|
458
|
-
['In stock: 5', 'In stock: 3', 'Out of stock']
|
|
459
|
-
```
|
|
460
|
-
|
|
461
|
-
- The `.json()` method converts the content to a JSON object if possible; otherwise, it throws an error:
|
|
462
|
-
```python
|
|
463
|
-
>>> page.css('#page-data::text').get()
|
|
464
|
-
'\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
|
|
465
|
-
>>> page.css('#page-data::text').get().json()
|
|
466
|
-
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
|
467
|
-
```
|
|
468
|
-
If no text node is specified while selecting an element, the text content is selected automatically:
|
|
469
|
-
```python
|
|
470
|
-
>>> page.css('#page-data')[0].json()
|
|
471
|
-
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
|
472
|
-
```
|
|
473
|
-
The [Selector](#selector) class adds additional behavior. Given this page:
|
|
474
|
-
```html
|
|
475
|
-
<html>
|
|
476
|
-
<body>
|
|
477
|
-
<div>
|
|
478
|
-
<script id="page-data" type="application/json">
|
|
479
|
-
{
|
|
480
|
-
"lastUpdated": "2024-09-22T10:30:00Z",
|
|
481
|
-
"totalProducts": 3
|
|
482
|
-
}
|
|
483
|
-
</script>
|
|
484
|
-
</div>
|
|
485
|
-
</body>
|
|
486
|
-
</html>
|
|
487
|
-
```
|
|
488
|
-
The [Selector](#selector) class has the `get_all_text` method, which returns a `TextHandler`. For example:
|
|
489
|
-
```python
|
|
490
|
-
>>> page.css('div::text').get().json()
|
|
491
|
-
```
|
|
492
|
-
This throws an error because the `div` tag has no direct text content. The `get_all_text` method handles this case:
|
|
493
|
-
```python
|
|
494
|
-
>>> page.css('div')[0].get_all_text(ignore_tags=[]).json()
|
|
495
|
-
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
|
496
|
-
```
|
|
497
|
-
The `ignore_tags` argument is used here because its default value is `('script', 'style',)`.
|
|
498
|
-
|
|
499
|
-
When dealing with a JSON response:
|
|
500
|
-
```python
|
|
501
|
-
>>> page = Selector("""{"some_key": "some_value"}""")
|
|
502
|
-
```
|
|
503
|
-
The [Selector](#selector) class is optimized for HTML, so it treats this as a broken HTML response and wraps it. The `html_content` property shows:
|
|
504
|
-
```python
|
|
505
|
-
>>> page.html_content
|
|
506
|
-
'<html><body><p>{"some_key": "some_value"}</p></body></html>'
|
|
507
|
-
```
|
|
508
|
-
The `json` method can be used directly:
|
|
509
|
-
```python
|
|
510
|
-
>>> page.json()
|
|
511
|
-
{'some_key': 'some_value'}
|
|
512
|
-
```
|
|
513
|
-
For JSON responses, the [Selector](#selector) class keeps a raw copy of the content it receives. When `.json()` is called, it checks for that raw copy first and converts it to JSON. If the raw copy is unavailable (as with sub-elements), it checks the current element's text content, then falls back to `get_all_text`.
|
|
514
|
-
|
|
515
|
-
- The `.clean()` method removes all whitespace and consecutive spaces, returning a new `TextHandler` instance:
|
|
516
|
-
```python
|
|
517
|
-
>>> TextHandler('\n wonderful idea, \reh?').clean()
|
|
518
|
-
'wonderful idea, eh?'
|
|
519
|
-
```
|
|
520
|
-
The `remove_entities` argument causes `clean` to replace HTML entities with their corresponding characters.
|
|
521
|
-
|
|
522
|
-
- The `.sort()` method sorts the string characters:
|
|
523
|
-
```python
|
|
524
|
-
>>> TextHandler('acb').sort()
|
|
525
|
-
'abc'
|
|
526
|
-
```
|
|
527
|
-
Or do it in reverse:
|
|
528
|
-
```python
|
|
529
|
-
>>> TextHandler('acb').sort(reverse=True)
|
|
530
|
-
'cba'
|
|
531
|
-
```
|
|
532
|
-
|
|
533
|
-
This class is returned in place of strings nearly everywhere in the library.
|
|
534
|
-
|
|
535
|
-
## TextHandlers
|
|
536
|
-
This class inherits from standard lists, adding `re` and `re_first` as new methods.
|
|
537
|
-
|
|
538
|
-
The `re_first` method runs `re` on each [TextHandler](#texthandler) and returns the first result, or `None`.
|
|
539
|
-
|
|
540
|
-
## AttributesHandler
|
|
541
|
-
This is a read-only version of Python's standard dictionary, or `dict`, used solely to store the attributes of each element/[Selector](#selector) instance.
|
|
542
|
-
```python
|
|
543
|
-
>>> print(page.find('script').attrib)
|
|
544
|
-
{'id': 'page-data', 'type': 'application/json'}
|
|
545
|
-
>>> type(page.find('script').attrib).__name__
|
|
546
|
-
'AttributesHandler'
|
|
547
|
-
```
|
|
548
|
-
Because it's read-only, it will use fewer resources than the standard dictionary. Still, it has the same dictionary method and properties, except those that allow you to modify/override the data.
|
|
549
|
-
|
|
550
|
-
It currently adds two extra simple methods:
|
|
551
|
-
|
|
552
|
-
- The `search_values` method
|
|
553
|
-
|
|
554
|
-
Searches the current attributes by values (rather than keys) and returns a dictionary of each matching item.
|
|
555
|
-
|
|
556
|
-
A simple example would be
|
|
557
|
-
```python
|
|
558
|
-
>>> for i in page.find('script').attrib.search_values('page-data'):
|
|
559
|
-
print(i)
|
|
560
|
-
{'id': 'page-data'}
|
|
561
|
-
```
|
|
562
|
-
But this method provides the `partial` argument as well, which allows you to search by part of the value:
|
|
563
|
-
```python
|
|
564
|
-
>>> for i in page.find('script').attrib.search_values('page', partial=True):
|
|
565
|
-
print(i)
|
|
566
|
-
{'id': 'page-data'}
|
|
567
|
-
```
|
|
568
|
-
A more practical example is using it with `find_all` to find all elements that have a specific value in their attributes:
|
|
569
|
-
```python
|
|
570
|
-
>>> page.find_all(lambda element: list(element.attrib.search_values('product')))
|
|
571
|
-
[<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
|
|
572
|
-
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
|
573
|
-
<data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
|
|
574
|
-
```
|
|
575
|
-
All these elements have 'product' as the value for the `class` attribute.
|
|
576
|
-
|
|
577
|
-
The `list` function is used here because `search_values` returns a generator, so it would be `True` for all elements.
|
|
578
|
-
|
|
579
|
-
- The `json_string` property
|
|
580
|
-
|
|
581
|
-
This property converts current attributes to a JSON string if the attributes are JSON serializable; otherwise, it throws an error.
|
|
582
|
-
|
|
583
|
-
```python
|
|
584
|
-
>>>page.find('script').attrib.json_string
|
|
585
|
-
b'{"id":"page-data","type":"application/json"}'
|
|
586
|
-
```
|