WebSearcher 0.7.0__tar.gz → 0.7.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- websearcher-0.7.2/.github/dependabot.yml +10 -0
- websearcher-0.7.2/.github/pull_request_template.md +5 -0
- websearcher-0.7.2/.github/workflows/publish.yml +33 -0
- websearcher-0.7.2/.github/workflows/test.yml +26 -0
- websearcher-0.7.2/.gitignore +22 -0
- websearcher-0.7.2/.pre-commit-config.yaml +16 -0
- websearcher-0.7.2/CHANGELOG.md +199 -0
- websearcher-0.7.2/CLAUDE.md +73 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/PKG-INFO +24 -142
- {websearcher-0.7.0 → websearcher-0.7.2}/README.md +20 -138
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/__init__.py +1 -1
- websearcher-0.7.2/WebSearcher/classifiers/__init__.py +7 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/classifiers/footer.py +1 -4
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/classifiers/main.py +108 -26
- websearcher-0.7.2/WebSearcher/component_parsers/__init__.py +107 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/ads.py +85 -102
- websearcher-0.7.2/WebSearcher/component_parsers/available_on.py +50 -0
- websearcher-0.7.2/WebSearcher/component_parsers/banner.py +39 -0
- websearcher-0.7.2/WebSearcher/component_parsers/discussions_and_forums.py +59 -0
- websearcher-0.7.2/WebSearcher/component_parsers/footer.py +36 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/general.py +236 -246
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/general_questions.py +17 -22
- websearcher-0.7.2/WebSearcher/component_parsers/images.py +105 -0
- websearcher-0.7.2/WebSearcher/component_parsers/knowledge.py +164 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/knowledge_rhs.py +46 -38
- websearcher-0.7.2/WebSearcher/component_parsers/latest_from.py +12 -0
- websearcher-0.7.2/WebSearcher/component_parsers/local_news.py +12 -0
- websearcher-0.7.2/WebSearcher/component_parsers/local_results.py +128 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/locations.py +16 -13
- websearcher-0.7.2/WebSearcher/component_parsers/map_results.py +21 -0
- websearcher-0.7.2/WebSearcher/component_parsers/news_quotes.py +55 -0
- websearcher-0.7.2/WebSearcher/component_parsers/notices.py +142 -0
- websearcher-0.7.2/WebSearcher/component_parsers/people_also_ask.py +34 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/perspectives.py +22 -24
- websearcher-0.7.2/WebSearcher/component_parsers/recent_posts.py +12 -0
- websearcher-0.7.2/WebSearcher/component_parsers/scholarly_articles.py +23 -0
- websearcher-0.7.2/WebSearcher/component_parsers/searches_related.py +60 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/shopping_ads.py +78 -84
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/short_videos.py +11 -11
- websearcher-0.7.2/WebSearcher/component_parsers/top_image_carousel.py +40 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/top_stories.py +101 -98
- websearcher-0.7.2/WebSearcher/component_parsers/twitter_cards.py +58 -0
- websearcher-0.7.2/WebSearcher/component_parsers/twitter_result.py +37 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/videos.py +28 -15
- websearcher-0.7.2/WebSearcher/component_parsers/view_more_news.py +50 -0
- websearcher-0.7.2/WebSearcher/component_types.py +442 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/components.py +2 -4
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/extractor_footer.py +2 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/extractor_main.py +86 -28
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/locations.py +36 -47
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/configs.py +4 -2
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/data.py +3 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/searches.py +4 -1
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/search_methods/requests_searcher.py +6 -5
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/search_methods/selenium_searcher.py +29 -20
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/searchers.py +1 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/utils.py +56 -22
- websearcher-0.7.2/docs/README.md +8 -0
- websearcher-0.7.2/docs/guides/changelog.md +68 -0
- websearcher-0.7.2/docs/plans/000-component-parsers-update.md +514 -0
- websearcher-0.7.2/docs/plans/001-component-parser-details-field.md +404 -0
- websearcher-0.7.2/docs/plans/002-class-consolidation.md +324 -0
- websearcher-0.7.2/docs/plans/003-ad-parser-structure.md +50 -0
- websearcher-0.7.2/docs/plans/004-ads-vs-other-parsers.md +130 -0
- websearcher-0.7.2/docs/plans/005-parser-updates-v0.6.7a2.md +176 -0
- websearcher-0.7.2/docs/plans/006-formalize-get-title.md +352 -0
- websearcher-0.7.2/docs/plans/007-formalize-get-title-prompt.md +80 -0
- websearcher-0.7.2/docs/plans/008-ci-test-data.md +80 -0
- websearcher-0.7.2/docs/plans/009-refactor-feature-extractor.md +75 -0
- websearcher-0.7.2/docs/plans/010-things-to-know-classifier.md +40 -0
- websearcher-0.7.2/docs/plans/011-structured-data-in-html.md +80 -0
- websearcher-0.7.2/docs/plans/012-parsing-diagnostics.md +65 -0
- websearcher-0.7.2/docs/plans/013-dom-position-reorder.md +153 -0
- websearcher-0.7.2/docs/plans/014-bump-0.6.9.md +90 -0
- websearcher-0.7.2/docs/plans/015-js-driven-urls.md +33 -0
- websearcher-0.7.2/docs/plans/016-standardize-data-models.md +237 -0
- websearcher-0.7.2/docs/plans/017-parse-pipeline-optimization.md +490 -0
- websearcher-0.7.2/docs/plans/018-visible-flag-on-results.md +230 -0
- websearcher-0.7.2/docs/plans/019-video-details-from-evlb-cards.md +123 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/pyproject.toml +12 -8
- websearcher-0.7.2/scripts/show_parsed.py +116 -0
- websearcher-0.7.2/scripts/show_serp.py +127 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[01f85d1329ba].json +179 -64
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[032572e185d3].json +37 -17
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[0d3fc3b49b76].json +15 -3
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[0ed311025efc].json +47 -7
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[130eba186e94].json +33 -5
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[18eccfe8454e].json +29 -9
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[2c0aa0bbcd0c].json +51 -37
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[2d1b05a046b2].json +15 -3
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[305b53af69be].json +125 -80
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[30c5d6bdb650].json +17 -5
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[39617f527744].json +18 -6
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[3c03a4a2cb7c].json +9 -3
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[3c09a0f0c92f].json +3 -3
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[3f5efb1dc358].json +19 -11
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[45b6e019bfa2].json +37 -17
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[4c8d8d2f226c].json +11 -3
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[53940e35cc92].json +52 -13
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[56cbcf8cd4dc].json +50 -11
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[56f2eab63e9d].json +17 -5
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[5898b04fb534].json +13 -5
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[6978d0cd767d].json +34 -18
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[6aa70651b0cd].json +36 -12
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[6e206db14899].json +13 -5
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[6e401e618433].json +64 -32
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7049404a2dd6].json +39 -23
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7333536d2911].json +56 -16
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7ad9715f3597].json +35 -7
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7b89c00120e3].json +61 -10
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7d76d3a83ebc].json +121 -76
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[811a27f92284].json +415 -156
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[83b17a6a7750].json +15 -3
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[8d1b75b71e7f].json +36 -12
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[8e820f7b024f].json +75 -23
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[8f98fa9c0bef].json +15 -3
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[9101d12ab778].json +38 -18
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[923a428c1c22].json +49 -9
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[97404b7b7c61].json +38 -18
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[984065877aad].json +35 -7
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[9a7e39d95bf0].json +19 -11
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[9ed1baa7715d].json +122 -56
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[a6c881e003e2].json +150 -26
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[a6c8fe7fe769].json +18 -6
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[aa594f199c3d].json +36 -12
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[b15c5131b06c].json +12 -4
- websearcher-0.7.2/tests/__snapshots__/test_parse_serp/test_parse_serp[b186024ec98a].json +413 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[b2e1777bf0f2].json +199 -153
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[be99c971b8f7].json +37 -17
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[c48f8aa3f6da].json +13 -5
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[c9ab650f5bda].json +37 -17
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[cad43c3268a8].json +105 -41
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[ce37f114963e].json +35 -7
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[d1855fa9cd1c].json +51 -7
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[d1ac0c4abb10].json +21 -9
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[d920789249af].json +20 -8
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[da9b4fce9ab0].json +33 -5
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[dc5861b33dda].json +15 -3
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[e71a1cb4cd70].json +122 -46
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[e828d00dc1b3].json +13 -5
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[eab14aa4ff5d].json +36 -8
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[f006c9318116].json +3 -3
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[f6fae1c9a96e].json +52 -20
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[faa9c7c889db].json +57 -25
- websearcher-0.7.2/tests/fixtures/serps-v0.7.2-knowledge-subcards.json.bz2 +0 -0
- websearcher-0.7.2/tests/test_component_types.py +96 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_utils.py +4 -6
- {websearcher-0.7.0 → websearcher-0.7.2}/uv.lock +189 -619
- websearcher-0.7.0/.github/workflows/publish.yml +0 -37
- websearcher-0.7.0/.github/workflows/test.yml +0 -32
- websearcher-0.7.0/.gitignore +0 -17
- websearcher-0.7.0/.pre-commit-config.yaml +0 -7
- websearcher-0.7.0/WebSearcher/classifiers/__init__.py +0 -11
- websearcher-0.7.0/WebSearcher/classifiers/header_components.py +0 -16
- websearcher-0.7.0/WebSearcher/classifiers/header_text.py +0 -163
- websearcher-0.7.0/WebSearcher/component_parsers/__init__.py +0 -104
- websearcher-0.7.0/WebSearcher/component_parsers/available_on.py +0 -37
- websearcher-0.7.0/WebSearcher/component_parsers/banner.py +0 -40
- websearcher-0.7.0/WebSearcher/component_parsers/discussions_and_forums.py +0 -60
- websearcher-0.7.0/WebSearcher/component_parsers/footer.py +0 -33
- websearcher-0.7.0/WebSearcher/component_parsers/images.py +0 -133
- websearcher-0.7.0/WebSearcher/component_parsers/knowledge.py +0 -149
- websearcher-0.7.0/WebSearcher/component_parsers/latest_from.py +0 -15
- websearcher-0.7.0/WebSearcher/component_parsers/local_news.py +0 -15
- websearcher-0.7.0/WebSearcher/component_parsers/local_results.py +0 -103
- websearcher-0.7.0/WebSearcher/component_parsers/map_results.py +0 -26
- websearcher-0.7.0/WebSearcher/component_parsers/news_quotes.py +0 -53
- websearcher-0.7.0/WebSearcher/component_parsers/notices.py +0 -173
- websearcher-0.7.0/WebSearcher/component_parsers/people_also_ask.py +0 -43
- websearcher-0.7.0/WebSearcher/component_parsers/recent_posts.py +0 -15
- websearcher-0.7.0/WebSearcher/component_parsers/scholarly_articles.py +0 -31
- websearcher-0.7.0/WebSearcher/component_parsers/searches_related.py +0 -53
- websearcher-0.7.0/WebSearcher/component_parsers/top_image_carousel.py +0 -38
- websearcher-0.7.0/WebSearcher/component_parsers/twitter_cards.py +0 -62
- websearcher-0.7.0/WebSearcher/component_parsers/twitter_result.py +0 -38
- websearcher-0.7.0/WebSearcher/component_parsers/view_more_news.py +0 -52
- websearcher-0.7.0/WebSearcher/models/cmpt_mappings.py +0 -193
- {websearcher-0.7.0 → websearcher-0.7.2}/.python-version +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/LICENSE +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/__init__.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/extractor_header.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/extractor_rhs.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/feature_extractor.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/logger.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/__init__.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/features.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/parsers.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/search_methods/__init__.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/scripts/condense_fixtures.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_locations.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_parse.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_screenshot.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_search.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_search_headers.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_searches.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/scripts/parsed_to_csv.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[82e35954f552].json +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/fixtures/serps-v0.6.7.json.bz2 +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/fixtures/serps-v0.6.8.json.bz2 +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_feature_extractor.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_locations.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_models.py +0 -0
- {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_parse_serp.py +0 -0
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
name: Publish
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
tags: ["v*"]
|
|
6
|
+
|
|
7
|
+
jobs:
|
|
8
|
+
build:
|
|
9
|
+
runs-on: ubuntu-latest
|
|
10
|
+
steps:
|
|
11
|
+
- uses: actions/checkout@v6
|
|
12
|
+
- uses: astral-sh/setup-uv@v8.1.0
|
|
13
|
+
- run: uv build
|
|
14
|
+
- uses: actions/upload-artifact@v7
|
|
15
|
+
with:
|
|
16
|
+
name: dist
|
|
17
|
+
path: dist/
|
|
18
|
+
|
|
19
|
+
publish:
|
|
20
|
+
needs: build
|
|
21
|
+
runs-on: ubuntu-latest
|
|
22
|
+
environment:
|
|
23
|
+
name: pypi
|
|
24
|
+
url: https://pypi.org/project/WebSearcher/
|
|
25
|
+
permissions:
|
|
26
|
+
id-token: write
|
|
27
|
+
contents: read
|
|
28
|
+
steps:
|
|
29
|
+
- uses: actions/download-artifact@v8
|
|
30
|
+
with:
|
|
31
|
+
name: dist
|
|
32
|
+
path: dist/
|
|
33
|
+
- uses: pypa/gh-action-pypi-publish@release/v1
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
name: Tests
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches: [dev, master]
|
|
6
|
+
pull_request:
|
|
7
|
+
branches: [dev, master]
|
|
8
|
+
|
|
9
|
+
permissions:
|
|
10
|
+
contents: read
|
|
11
|
+
|
|
12
|
+
jobs:
|
|
13
|
+
test:
|
|
14
|
+
runs-on: ubuntu-latest
|
|
15
|
+
strategy:
|
|
16
|
+
matrix:
|
|
17
|
+
python-version: ["3.12", "3.13", "3.14"]
|
|
18
|
+
steps:
|
|
19
|
+
- uses: actions/checkout@v6
|
|
20
|
+
- uses: astral-sh/setup-uv@v8.1.0
|
|
21
|
+
- run: uv python install ${{ matrix.python-version }}
|
|
22
|
+
- run: uv sync --all-groups --python ${{ matrix.python-version }}
|
|
23
|
+
- run: uv run ruff check .
|
|
24
|
+
- run: uv run ruff format --check .
|
|
25
|
+
- run: uv run pyrefly check
|
|
26
|
+
- run: uv run pytest --cov --cov-report=term-missing
|
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
.venv/
|
|
2
|
+
.archive/
|
|
3
|
+
.claude
|
|
4
|
+
TODO.md
|
|
5
|
+
|
|
6
|
+
build/
|
|
7
|
+
data/
|
|
8
|
+
notebooks/
|
|
9
|
+
scripts/ads-no-subtype/
|
|
10
|
+
|
|
11
|
+
*.egg-info
|
|
12
|
+
*__pycache__
|
|
13
|
+
|
|
14
|
+
# Caches
|
|
15
|
+
.pytest_cache/
|
|
16
|
+
.ruff_cache/
|
|
17
|
+
.coverage
|
|
18
|
+
|
|
19
|
+
# Environment files
|
|
20
|
+
.env
|
|
21
|
+
.env.*
|
|
22
|
+
!.env.example
|
|
@@ -0,0 +1,16 @@
|
|
|
1
|
+
repos:
|
|
2
|
+
- repo: https://github.com/astral-sh/ruff-pre-commit
|
|
3
|
+
rev: v0.15.5
|
|
4
|
+
hooks:
|
|
5
|
+
- id: ruff-format
|
|
6
|
+
- id: ruff
|
|
7
|
+
args: [--fix]
|
|
8
|
+
- repo: local
|
|
9
|
+
hooks:
|
|
10
|
+
- id: pyrefly-check
|
|
11
|
+
name: pyrefly check
|
|
12
|
+
entry: uv run pyrefly check
|
|
13
|
+
language: system
|
|
14
|
+
types_or: [python, pyi]
|
|
15
|
+
pass_filenames: false
|
|
16
|
+
require_serial: true
|
|
@@ -0,0 +1,199 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/).
|
|
7
|
+
|
|
8
|
+
## [Unreleased]
|
|
9
|
+
|
|
10
|
+
## [0.7.1] - 2026-05-03
|
|
11
|
+
|
|
12
|
+
- Added component type registry: consolidates header-text mappings, parser dispatch, and labels (supersedes `cmpt_mappings.py`)
|
|
13
|
+
- Added `find_by_selectors` utility; applied to classifiers
|
|
14
|
+
- Added `Selector` NamedTuple in `utils` for tag/attrs selector lists
|
|
15
|
+
- Added pyrefly type checking; fixed all type errors across the codebase
|
|
16
|
+
- Added structural tests for component type registry; resolved "Ver más" collision
|
|
17
|
+
- Refactored ad parser and classifier; tidied carousel scope; edited local parser/classifier selectors
|
|
18
|
+
- Restored `knowledge_panel` cmpt-attr check on `jscontroller qTdDb`
|
|
19
|
+
- Moved `ClassifyHeaderText` into `main.py` as `ClassifyMainHeader`; removed dead `ClassifyHeaderComponent`
|
|
20
|
+
- Updated CI: ruff lint + format checks, pyrefly check, coverage; matrix narrowed to 3.12-3.14
|
|
21
|
+
- Switched publish workflow to tag-based with split build + publish jobs
|
|
22
|
+
- Added Dependabot config, PR template, pre-commit ruff + pyrefly hooks
|
|
23
|
+
- Bumped support floor: `requires-python = ">=3.12"`, ruff/pyrefly `target-version = py312`
|
|
24
|
+
- Bumped security floors: `requests>=2.33.0`, `lxml>=6.1.0`, `pytest>=9.0.3`; pulled Dependabot patches for `pygments`
|
|
25
|
+
- Fixed `ResponseOutput` not subscriptable for dict-style access
|
|
26
|
+
- Added `CHANGELOG.md` and changelog guide (migrated from README)
|
|
27
|
+
- Added `docs/plans/` development history with landing page
|
|
28
|
+
|
|
29
|
+
## [0.7.0] - 2026-03-15
|
|
30
|
+
|
|
31
|
+
- **Breaking:** `details` field is now always `dict | None` with a self-describing `type` key (e.g. `{"type": "menu", "items": [...]}`)
|
|
32
|
+
- **Breaking:** `parse_serp()` now always returns a dict with `results` and `features` keys; the `extract_features` parameter has been removed
|
|
33
|
+
- Standardized all models on Pydantic BaseModel (removed dataclasses)
|
|
34
|
+
- Added `ResponseOutput` and `ParsedSERP` typed models
|
|
35
|
+
- Removed `DetailsItem`, `DetailsList` classes
|
|
36
|
+
- Normalized `local_results` sub_type for location-specific headers
|
|
37
|
+
- Replaced `os` with `pathlib.Path` throughout
|
|
38
|
+
- Consolidated `webutils.py` into `utils.py`
|
|
39
|
+
- Added ruff formatting, linting, and pre-commit hooks
|
|
40
|
+
- Added test coverage reporting (69%)
|
|
41
|
+
- Added unit tests for utils, locations, models, and feature extractor
|
|
42
|
+
- Replaced pandas with polars in demo scripts
|
|
43
|
+
|
|
44
|
+
## [0.6.9] - 2026-02-22
|
|
45
|
+
|
|
46
|
+
- Fixed bugs in component parsers (class comparison, assignment operator, set literal)
|
|
47
|
+
- Fixed `return` in `finally` block in requests searcher
|
|
48
|
+
- Added captcha detection to feature extractor
|
|
49
|
+
- Added captcha handling and jittered delay to demo searches
|
|
50
|
+
- Dropped pandas from core dependencies
|
|
51
|
+
- Cleaned up legacy typing imports
|
|
52
|
+
- Removed poetry.toml
|
|
53
|
+
|
|
54
|
+
## [0.6.8] - 2026-02-20
|
|
55
|
+
|
|
56
|
+
- Migrated from Poetry to uv for dependency management
|
|
57
|
+
- Added Python 3.12-3.14 test matrix in GitHub Actions
|
|
58
|
+
- Added `flights` classifier and `standard-4` layout
|
|
59
|
+
- Added local service ad parser
|
|
60
|
+
- Extracted bottom ads before main column
|
|
61
|
+
- Fixed `return` in `finally` block warning in selenium searcher
|
|
62
|
+
|
|
63
|
+
## [0.6.7] - 2026-02-06
|
|
64
|
+
|
|
65
|
+
- Added `get_text_by_selectors()` to `webutils` -- centralizes multi-selector fallback pattern across 7 component parsers
|
|
66
|
+
- Added `perspectives`, `recent_posts`, and `latest_from` component classifiers
|
|
67
|
+
- Added `sub_type` to perspectives parser from header text
|
|
68
|
+
- Added CI test workflow on push to dev branch
|
|
69
|
+
- Added compressed test fixtures with `condense_fixtures.py` script
|
|
70
|
+
- Updated dependency lower bounds for security patches (protobuf, orjson)
|
|
71
|
+
- Updated GitHub Actions to checkout v6 and setup-python v6
|
|
72
|
+
|
|
73
|
+
## [0.6.6] - 2025-12-05
|
|
74
|
+
|
|
75
|
+
- Update packages with dependabot alerts (brotli, urllib3)
|
|
76
|
+
|
|
77
|
+
## [0.6.5] - 2025-12-05
|
|
78
|
+
|
|
79
|
+
- Add GitHub Actions section to README
|
|
80
|
+
|
|
81
|
+
## [0.6.0] - 2025-03-28
|
|
82
|
+
|
|
83
|
+
- Method for collecting data with selenium; requests no longer works without a redirect
|
|
84
|
+
- Pull request [#72](https://github.com/gitronald/WebSearcher/pull/72)
|
|
85
|
+
|
|
86
|
+
## [0.5.2] - 2025-03-09
|
|
87
|
+
|
|
88
|
+
- Added support for Spanish component headers by text
|
|
89
|
+
- Pull request [#74](https://github.com/gitronald/WebSearcher/pull/74)
|
|
90
|
+
|
|
91
|
+
## [0.5.1] - 2025-03-07
|
|
92
|
+
|
|
93
|
+
- Fixed canonical name -> UULE converter using `protobuf`, see [this gist](https://gist.github.com/gitronald/66cac42194ea2d489ff3a1e32651e736) for details
|
|
94
|
+
- Added lang arg to specify language in se.search, uses hl URL param and does not change Accept-Language request header (which defaults to en-US), but works in tests.
|
|
95
|
+
- Fixed null location/language arg input handling (again)
|
|
96
|
+
- Pull Request [#76](https://github.com/gitronald/WebSearcher/pull/76)
|
|
97
|
+
|
|
98
|
+
## [0.5.0] - 2025-02-03
|
|
99
|
+
|
|
100
|
+
- configuration now using poetry v2
|
|
101
|
+
|
|
102
|
+
## [0.4.9] - 2025-02-03
|
|
103
|
+
|
|
104
|
+
- last version with poetry v1, future versions (`>=0.5.0`) will use [poetry v2](https://python-poetry.org/blog/announcing-poetry-2.0.1/) configs.
|
|
105
|
+
|
|
106
|
+
## [0.4.2] - [0.4.8] - 2024-11-11 to 2025-02-03
|
|
107
|
+
|
|
108
|
+
- varied parser updates, testing with py3.12.
|
|
109
|
+
|
|
110
|
+
## [0.4.1] - 2024-08-26
|
|
111
|
+
|
|
112
|
+
- Added notices component types, including query edits, suggestions, language tips, and location tips.
|
|
113
|
+
|
|
114
|
+
## [0.4.0] - 2024-05-27
|
|
115
|
+
|
|
116
|
+
- Restructured parser for component classes, split classifier into submodules for header, main, footer, etc., and rewrote extractors to work with component classes. Various bug fixes.
|
|
117
|
+
|
|
118
|
+
## [0.3.13]
|
|
119
|
+
|
|
120
|
+
- New footer parser, broader extraction coverage, various bug and deprecation fixes.
|
|
121
|
+
|
|
122
|
+
## [0.3.12] - 2024-05-09
|
|
123
|
+
|
|
124
|
+
- Added num_results to search args, added handling for local results text and labels (made by the SE), ignore hidden_survey type at extraction.
|
|
125
|
+
|
|
126
|
+
## [0.3.11] - 2024-05-08
|
|
127
|
+
|
|
128
|
+
- Added extraction of labels for ads (made by the SE), use model validation, cleanup and various bug fixes.
|
|
129
|
+
|
|
130
|
+
## [0.3.10] - 2024-05-06
|
|
131
|
+
|
|
132
|
+
- Updated component classifier for images, added exportable header text mappings, added gist on localized searches.
|
|
133
|
+
|
|
134
|
+
## [0.3.9] - 2024-02-25
|
|
135
|
+
|
|
136
|
+
- Small fixes for video url parsing
|
|
137
|
+
|
|
138
|
+
## [0.3.8] - 2024-02-13
|
|
139
|
+
|
|
140
|
+
- Using SERP pydantic model, added github pip publishing workflow
|
|
141
|
+
|
|
142
|
+
## [0.3.7] - 2024-02-09
|
|
143
|
+
|
|
144
|
+
- Fixed localization, parser and classifier updates and fixes, image subtypes, changed rhs component handling.
|
|
145
|
+
|
|
146
|
+
## [0.3.0] - [0.3.6] - 2023-10-16 to 2023-12-08
|
|
147
|
+
|
|
148
|
+
- Parser updates for SERPs from 2022 and 2023, standalone extractors file, added pydantic, reduced redundancies in outputs.
|
|
149
|
+
|
|
150
|
+
## [2020.0.0], [2022.12.18], [2023.01.04] - 2022-12-19, 2022-12-21, and 2023-01-04
|
|
151
|
+
|
|
152
|
+
- Various updates, attempt at date versioning that seemed like a good idea at the time ¯\\\_(ツ)\_/¯
|
|
153
|
+
|
|
154
|
+
<!-- refs/tags/v2022.12.18 -->
|
|
155
|
+
<!-- refs/tags/v2023.01.04 -->
|
|
156
|
+
|
|
157
|
+
## [0.2.15]
|
|
158
|
+
|
|
159
|
+
- Fix people-also-ask and hotel false positives, add flag for left-hand side bar
|
|
160
|
+
|
|
161
|
+
## [0.2.14]
|
|
162
|
+
|
|
163
|
+
- Add shopping ads carousel and three knowledge subtypes (flights, hotels, events)
|
|
164
|
+
|
|
165
|
+
## [0.2.13]
|
|
166
|
+
|
|
167
|
+
- Small fixes for knowledge subtypes, general subtypes, and ads
|
|
168
|
+
|
|
169
|
+
## [0.2.12] - 2021-12-17
|
|
170
|
+
|
|
171
|
+
- Try to brotli decompress by default
|
|
172
|
+
|
|
173
|
+
## [0.2.11] - 2021-11-15
|
|
174
|
+
|
|
175
|
+
- Fixed local result parser and no return in general extra details
|
|
176
|
+
|
|
177
|
+
## [0.2.10]
|
|
178
|
+
|
|
179
|
+
- a) Add right-hand-side knowledge panel and top image carousel, b) Add knowledge and general component subtypes, c) Updates to component classifier, footer, ad, and people_also_ask components
|
|
180
|
+
|
|
181
|
+
## [0.2.9] - 2021-05-08
|
|
182
|
+
|
|
183
|
+
- Various fixes for SERPs with a left-hand side bar, which are becoming more common and change other parts of the SERP layout.
|
|
184
|
+
|
|
185
|
+
## [0.2.8] - 2021-03-08
|
|
186
|
+
|
|
187
|
+
- Small fixes due to HTML changes, such as missing titles and URLs in general components
|
|
188
|
+
|
|
189
|
+
## [0.2.7] - 2020-11-30
|
|
190
|
+
|
|
191
|
+
- Added fix for parsing twitter cards, removed pandas dependencies and several unused functions, moving towards greater package simplicity.
|
|
192
|
+
|
|
193
|
+
## [0.2.6] - 2020-11-14
|
|
194
|
+
|
|
195
|
+
- Updated ad parser for latest format, still handles older ad format.
|
|
196
|
+
|
|
197
|
+
## [0.2.5] - 2020-07-24
|
|
198
|
+
|
|
199
|
+
- Google Search, like most online platforms, undergoes changes over time. These changes often affect not just their outward appearance, but the underlying code that parsers depend on. This makes parsing a goal with a moving target. Sometime around February 2020, Google changed a few elements of their HTML structure which broke this parser. I created this patch for these changes, but have not tested its backwards compatibility (e.g. on SERPs collected prior to 2/2020). More generally, there's no guarantee on future compatibility. In fact, there is almost certainly the opposite: more changes will inevitably occur. If you have older data that you need to parse and the current parser doesn't work, you can try using `0.2.1`, or send a pull request if you find a way to make both work!
|
|
@@ -0,0 +1,73 @@
|
|
|
1
|
+
# WebSearcher
|
|
2
|
+
|
|
3
|
+
Parser library for Google search result pages. Each SERP is decomposed into a typed list of `results` (e.g. `general`, `top_stories`, `perspectives`, `local_results`, `available_on`, `videos`, `searches_related`, ...) by an `Extractor` + `ClassifyMain`/`ClassifyFooter` pipeline, then per-component parsers populate `title`, `url`, `text`, `cite`, `details`, etc.
|
|
4
|
+
|
|
5
|
+
## Running code
|
|
6
|
+
|
|
7
|
+
All Python execution goes through `uv` (the project's `.python-version` is 3.14):
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
uv run python scripts/...
|
|
11
|
+
uv run pytest
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
Do not use `poetry`, `python`, or `python3` directly. Some skill files still reference `poetry` from before the migration — treat those as stale.
|
|
15
|
+
|
|
16
|
+
## Inspection scripts
|
|
17
|
+
|
|
18
|
+
Use these when diagnosing parser issues or auditing a demo dataset:
|
|
19
|
+
|
|
20
|
+
| Script | When |
|
|
21
|
+
|---|---|
|
|
22
|
+
| `scripts/show_parsed.py "{query}" --data-dir {dir}` | See the parsed-results table for a SERP (rank, type, sub_type, title, url, details summary). Re-parses fresh on every call — always reflects current code. |
|
|
23
|
+
| `scripts/show_serp.py "{query}" --data-dir {dir}` | Serve the saved HTML on localhost:8765 with Google's overlays/scroll-locks stripped. `--raw` keeps the original. |
|
|
24
|
+
| `scripts/demo_screenshot.py --data-dir {dir}` | Render a screenshot of the SERP with component-type colored borders. |
|
|
25
|
+
| `scripts/show_serp.py --list --data-dir {dir}` | List the queries in a dataset. |
|
|
26
|
+
|
|
27
|
+
Default `--data-dir` resolves to `data/demo-ws-v{ws.__version__}/`. Pass an explicit path for older fixtures or `/tmp` extracts.
|
|
28
|
+
|
|
29
|
+
## Demo data layout
|
|
30
|
+
|
|
31
|
+
```
|
|
32
|
+
data/demo-ws-v{version}/
|
|
33
|
+
serps.json # JSONL: {qry, html, serp_id, ...}
|
|
34
|
+
parsed.json # JSONL: {results: [...], features: {...}, ...} — pre-parsed output
|
|
35
|
+
searches.json # JSONL: {qry, serp_id, timestamp, ...} — search metadata
|
|
36
|
+
tests/fixtures/
|
|
37
|
+
serps-v0.6.7.json.bz2 # older bz2-compressed JSONL fixtures
|
|
38
|
+
serps-v0.6.8.json.bz2 # — useful for regressions; the "northern lights" SERP used in plan 018/019 lives here
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
## Skills (`.claude/skills/`)
|
|
42
|
+
|
|
43
|
+
| Skill | Use case |
|
|
44
|
+
|---|---|
|
|
45
|
+
| `/parser-update` | 7-phase parser diagnose-and-fix workflow on demo data. Now references the inspection scripts above. |
|
|
46
|
+
| `/compare-parsed` | Compare parsed output before/after uncommitted changes — regression check. |
|
|
47
|
+
| `/compare-selectors` | Diff selectors between current and committed versions of a parser file. |
|
|
48
|
+
| `/reparse` | Re-parse fixtures with current code and diff against pre-parsed output. |
|
|
49
|
+
|
|
50
|
+
## Plans and TODO
|
|
51
|
+
|
|
52
|
+
- `docs/plans/{NNN}-{slug}.md` — design specs with `status: draft|active|done|abandoned` frontmatter
|
|
53
|
+
- `TODO.md` — gitignored, flat list of open/closed items, each pointing to a plan or marked `(no plan)`
|
|
54
|
+
|
|
55
|
+
Conventions per global rules in `~/.claude/rules/plan-files.md` and `~/.claude/rules/todo-files.md`.
|
|
56
|
+
|
|
57
|
+
## Schema conventions
|
|
58
|
+
|
|
59
|
+
A `details` block on a parsed result uses a controlled vocabulary of `type` values. Try to reuse existing labels rather than invent new ones:
|
|
60
|
+
|
|
61
|
+
- `text` — `{items: [str, ...]}` (people_also_ask, searches_related)
|
|
62
|
+
- `hyperlinks` — `{items: [{url, text, ...}]}` (available_on, top_image_carousel, knowledge_rhs, footer image cards, general submenu)
|
|
63
|
+
- `ratings` — rating data, see general's modern rating widget and shopping_ads
|
|
64
|
+
- `place` — local_results: rating, n_reviews, price, category, address, phone, hours, review_snippet, website, directions
|
|
65
|
+
- `panel`, `video` — used by knowledge / general video subtypes
|
|
66
|
+
|
|
67
|
+
When a `details` block would have only null fields, set `details=None` instead of emitting a hollow dict.
|
|
68
|
+
|
|
69
|
+
## Notable parser dependencies
|
|
70
|
+
|
|
71
|
+
- `top_stories.parse_top_stories` is shared by `top_stories`, `perspectives`, `local_news`, `recent_posts`, and `latest_from`. One selector change ripples to all five. Verify with `/compare-parsed` after edits.
|
|
72
|
+
- `parse_subtype_details` in `general.py` has its own elif chain for layout variants (subresult, submenu, submenu_rating, submenu_mini, submenu_scholarly, submenu_product). Some branches are dormant on modern SERPs; modern rating widget reads from `Y0A0hc`/`z3HNkc` aria-labels.
|
|
73
|
+
- `classifiers/main.py` runs an ordered chain of classifiers; order matters when components match multiple. `available_on` was recently moved ahead of `knowledge_panel` because the streaming-providers widget was being absorbed.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: WebSearcher
|
|
3
|
-
Version: 0.7.
|
|
3
|
+
Version: 0.7.2
|
|
4
4
|
Summary: Tools for conducting, collecting, and parsing web search
|
|
5
5
|
Project-URL: homepage, http://github.com/gitronald/WebSearcher
|
|
6
6
|
Project-URL: repository, http://github.com/gitronald/WebSearcher
|
|
@@ -8,14 +8,14 @@ Author-email: "Ronald E. Robertson" <rer@acm.org>
|
|
|
8
8
|
License-Expression: GPL-3.0
|
|
9
9
|
License-File: LICENSE
|
|
10
10
|
Keywords: parser,search,web
|
|
11
|
-
Requires-Python: >=3.
|
|
11
|
+
Requires-Python: >=3.12
|
|
12
12
|
Requires-Dist: beautifulsoup4>=4.12.3
|
|
13
13
|
Requires-Dist: brotli>=1.1.0
|
|
14
|
-
Requires-Dist: lxml>=
|
|
14
|
+
Requires-Dist: lxml>=6.1.0
|
|
15
15
|
Requires-Dist: orjson<4.0.0,>=3.11.5
|
|
16
16
|
Requires-Dist: protobuf<7.0.0,>=6.33.5
|
|
17
17
|
Requires-Dist: pydantic>=2.9.2
|
|
18
|
-
Requires-Dist: requests>=2.
|
|
18
|
+
Requires-Dist: requests>=2.33.0
|
|
19
19
|
Requires-Dist: selenium>=4.9.0
|
|
20
20
|
Requires-Dist: tldextract>=5.1.2
|
|
21
21
|
Requires-Dist: undetected-chromedriver>=3.5.5
|
|
@@ -29,41 +29,40 @@ This package provides tools for conducting algorithm audits of web search and
|
|
|
29
29
|
includes a scraper built on `selenium` with tools for geolocating, conducting,
|
|
30
30
|
and saving searches. It also includes a modular parser built on `BeautifulSoup`
|
|
31
31
|
for decomposing a SERP into list of components with categorical classifications
|
|
32
|
-
and position-based specifications.
|
|
32
|
+
and position-based specifications.
|
|
33
|
+
|
|
34
|
+
## Recent Changes
|
|
35
|
+
|
|
36
|
+
- `0.7.1`: Added component type registry and pyrefly type checking; refreshed CI/tooling (lint, format, type-check, tag-based publish); bumped Python floor to 3.12
|
|
37
|
+
- `0.7.0`: Breaking changes, standardized data models on Pydantic, typed `details` field, and removed `DetailsItem`/`DetailsList`
|
|
38
|
+
|
|
39
|
+
See [CHANGELOG.md](CHANGELOG.md) for a longer history of changes by version.
|
|
33
40
|
|
|
34
41
|
## Table of Contents
|
|
35
42
|
|
|
36
43
|
- [WebSearcher](#websearcher)
|
|
44
|
+
- [Tools for conducting and parsing web searches](#tools-for-conducting-and-parsing-web-searches)
|
|
45
|
+
- [Recent Changes](#recent-changes)
|
|
46
|
+
- [Table of Contents](#table-of-contents)
|
|
37
47
|
- [Getting Started](#getting-started)
|
|
38
48
|
- [Usage](#usage)
|
|
39
49
|
- [Example Search Script](#example-search-script)
|
|
40
50
|
- [Step by Step](#step-by-step)
|
|
51
|
+
- [1. Initialize Collector](#1-initialize-collector)
|
|
52
|
+
- [2. Conduct a Search](#2-conduct-a-search)
|
|
53
|
+
- [3. Parse Search Results](#3-parse-search-results)
|
|
54
|
+
- [4. Save HTML and Metadata](#4-save-html-and-metadata)
|
|
55
|
+
- [5. Save Parsed Results](#5-save-parsed-results)
|
|
41
56
|
- [Localization](#localization)
|
|
42
57
|
- [Contributing](#contributing)
|
|
58
|
+
- [Repair or Enhance a Parser](#repair-or-enhance-a-parser)
|
|
59
|
+
- [Add a Parser](#add-a-parser)
|
|
60
|
+
- [Testing](#testing)
|
|
61
|
+
- [Test Fixtures](#test-fixtures)
|
|
43
62
|
- [GitHub Actions](#github-actions)
|
|
44
|
-
- [Recent Updates](#recent-updates)
|
|
45
|
-
- [Update Log](#update-log)
|
|
46
63
|
- [Similar Packages](#similar-packages)
|
|
47
64
|
- [License](#license)
|
|
48
65
|
|
|
49
|
-
---
|
|
50
|
-
## Recent Updates
|
|
51
|
-
|
|
52
|
-
### 0.7.0 (dev)
|
|
53
|
-
|
|
54
|
-
- **Breaking:** `details` field is now always `dict | None` with a self-describing `type` key (e.g. `{"type": "menu", "items": [...]}`)
|
|
55
|
-
- **Breaking:** `parse_serp()` now always returns a dict with `results` and `features` keys; the `extract_features` parameter has been removed
|
|
56
|
-
- Standardized all models on Pydantic BaseModel (removed dataclasses)
|
|
57
|
-
- Added `ResponseOutput` and `ParsedSERP` typed models
|
|
58
|
-
- Removed `DetailsItem`, `DetailsList` classes
|
|
59
|
-
- Normalized `local_results` sub_type for location-specific headers
|
|
60
|
-
- Replaced `os` with `pathlib.Path` throughout
|
|
61
|
-
- Consolidated `webutils.py` into `utils.py`
|
|
62
|
-
- Added ruff formatting, linting, and pre-commit hooks
|
|
63
|
-
- Added test coverage reporting (69%)
|
|
64
|
-
- Added unit tests for utils, locations, models, and feature extractor
|
|
65
|
-
- Replaced pandas with polars in demo scripts
|
|
66
|
-
|
|
67
66
|
---
|
|
68
67
|
## Getting Started
|
|
69
68
|
|
|
@@ -295,123 +294,6 @@ To release a new version:
|
|
|
295
294
|
2. Once merged, the package is automatically published to PyPI
|
|
296
295
|
|
|
297
296
|
---
|
|
298
|
-
## Update Log
|
|
299
|
-
|
|
300
|
-
`0.7.0`
|
|
301
|
-
- Standardize data models on Pydantic, typed details field, remove DetailsItem/DetailsList
|
|
302
|
-
|
|
303
|
-
`0.6.9`
|
|
304
|
-
- Fixed bugs in component parsers (class comparison, assignment operator, set literal)
|
|
305
|
-
- Fixed `return` in `finally` block in requests searcher
|
|
306
|
-
- Added captcha detection to feature extractor
|
|
307
|
-
- Added captcha handling and jittered delay to demo searches
|
|
308
|
-
- Dropped pandas from core dependencies
|
|
309
|
-
- Cleaned up legacy typing imports
|
|
310
|
-
- Removed poetry.toml
|
|
311
|
-
|
|
312
|
-
`0.6.8`
|
|
313
|
-
- Migrated from Poetry to uv for dependency management
|
|
314
|
-
- Added Python 3.12-3.14 test matrix in GitHub Actions
|
|
315
|
-
- Added `flights` classifier and `standard-4` layout
|
|
316
|
-
- Added local service ad parser
|
|
317
|
-
- Extracted bottom ads before main column
|
|
318
|
-
- Fixed `return` in `finally` block warning in selenium searcher
|
|
319
|
-
|
|
320
|
-
`0.6.7`
|
|
321
|
-
- Added `get_text_by_selectors()` to `webutils` -- centralizes multi-selector fallback pattern across 7 component parsers
|
|
322
|
-
- Added `perspectives`, `recent_posts`, and `latest_from` component classifiers
|
|
323
|
-
- Added `sub_type` to perspectives parser from header text
|
|
324
|
-
- Added CI test workflow on push to dev branch
|
|
325
|
-
- Added compressed test fixtures with `condense_fixtures.py` script
|
|
326
|
-
- Updated dependency lower bounds for security patches (protobuf, orjson)
|
|
327
|
-
- Updated GitHub Actions to checkout v6 and setup-python v6
|
|
328
|
-
|
|
329
|
-
`0.6.6`
|
|
330
|
-
- Update packages with dependabot alerts (brotli, urllib3)
|
|
331
|
-
|
|
332
|
-
`0.6.5`
|
|
333
|
-
- Add GitHub Actions section to README
|
|
334
|
-
|
|
335
|
-
`0.6.0`
|
|
336
|
-
- Method for collecting data with selenium; requests no longer works without a redirect
|
|
337
|
-
- Pull request [#72](https://github.com/gitronald/WebSearcher/pull/72)
|
|
338
|
-
|
|
339
|
-
`0.5.2`
|
|
340
|
-
- Added support for Spanish component headers by text
|
|
341
|
-
- Pull request [#74](https://github.com/gitronald/WebSearcher/pull/74)
|
|
342
|
-
|
|
343
|
-
`0.5.1`
|
|
344
|
-
- Fixed canonical name -> UULE converter using `protobuf`, see [this gist](https://gist.github.com/gitronald/66cac42194ea2d489ff3a1e32651e736) for details
|
|
345
|
-
- Added lang arg to specify language in se.search, uses hl URL param and does not change Accept-Language request header (which defaults to en-US), but works in tests.
|
|
346
|
-
- Fixed null location/language arg input handling (again)
|
|
347
|
-
- Pull Request [#76](https://github.com/gitronald/WebSearcher/pull/76)
|
|
348
|
-
|
|
349
|
-
`0.5.0`
|
|
350
|
-
- configuration now using poetry v2
|
|
351
|
-
|
|
352
|
-
`0.4.9` - last version with poetry v1, future versions (`>=0.5.0`) will use [poetry v2](https://python-poetry.org/blog/announcing-poetry-2.0.1/) configs.
|
|
353
|
-
|
|
354
|
-
`0.4.2` - `0.4.8` - varied parser updates, testing with py3.12.
|
|
355
|
-
|
|
356
|
-
`0.4.1` - Added notices component types, including query edits, suggestions, language tips, and location tips.
|
|
357
|
-
|
|
358
|
-
`0.4.0` - Restructured parser for component classes, split classifier into submodules for header, main, footer, etc., and rewrote extractors to work with component classes. Various bug fixes.
|
|
359
|
-
|
|
360
|
-
`0.3.13` - New footer parser, broader extraction coverage, various bug and deprecation fixes.
|
|
361
|
-
|
|
362
|
-
`0.3.12` - Added num_results to search args, added handling for local results text and labels (made by the SE), ignore hidden_survey type at extraction.
|
|
363
|
-
|
|
364
|
-
`0.3.11` - Added extraction of labels for ads (made by the SE), use model validation, cleanup and various bug fixes.
|
|
365
|
-
|
|
366
|
-
`0.3.10` - Updated component classifier for images, added exportable header text mappings, added gist on localized searches.
|
|
367
|
-
|
|
368
|
-
`0.3.9` - Small fixes for video url parsing
|
|
369
|
-
|
|
370
|
-
`0.3.8` - Using SERP pydantic model, added github pip publishing workflow
|
|
371
|
-
|
|
372
|
-
`0.3.7` - Fixed localization, parser and classifier updates and fixes, image subtypes, changed rhs component handling.
|
|
373
|
-
|
|
374
|
-
`0.3.0` - `0.3.6` - Parser updates for SERPs from 2022 and 2023, standalone extractors file, added pydantic, reduced redundancies in outputs.
|
|
375
|
-
|
|
376
|
-
`2020.0.0`, `2022.12.18`, `2023.01.04` - Various updates, attempt at date versioning that seemed like a good idea at the time ¯\\\_(ツ)\_/¯
|
|
377
|
-
|
|
378
|
-
<!-- refs/tags/v2022.12.18 -->
|
|
379
|
-
<!-- refs/tags/v2023.01.04 -->
|
|
380
|
-
|
|
381
|
-
`0.2.15` - Fix people-also-ask and hotel false positives, add flag for left-hand side bar
|
|
382
|
-
|
|
383
|
-
`0.2.14` - Add shopping ads carousel and three knowledge subtypes (flights, hotels, events)
|
|
384
|
-
|
|
385
|
-
`0.2.13` - Small fixes for knowledge subtypes, general subtypes, and ads
|
|
386
|
-
|
|
387
|
-
`0.2.12` - Try to brotli decompress by default
|
|
388
|
-
|
|
389
|
-
`0.2.11` - Fixed local result parser and no return in general extra details
|
|
390
|
-
|
|
391
|
-
`0.2.10` - a) Add right-hand-side knowledge panel and top image carousel, b) Add knowledge and general component subtypes, c) Updates to component classifier, footer, ad, and people_also_ask components
|
|
392
|
-
|
|
393
|
-
`0.2.9` - Various fixes for SERPs with a left-hand side bar, which are becoming more common and change other parts of the SERP layout.
|
|
394
|
-
|
|
395
|
-
`0.2.8` - Small fixes due to HTML changes, such as missing titles and URLs in general components
|
|
396
|
-
|
|
397
|
-
`0.2.7` - Added fix for parsing twitter cards, removed pandas dependencies and
|
|
398
|
-
several unused functions, moving towards greater package simplicity.
|
|
399
|
-
|
|
400
|
-
`0.2.6` - Updated ad parser for latest format, still handles older ad format.
|
|
401
|
-
|
|
402
|
-
`0.2.5` - Google Search, like most online platforms, undergoes changes over time.
|
|
403
|
-
These changes often affect not just their outward appearance, but the underlying
|
|
404
|
-
code that parsers depend on. This makes parsing a goal with a moving target.
|
|
405
|
-
Sometime around February 2020, Google changed a few elements of their HTML
|
|
406
|
-
structure which broke this parser. I created this patch for these changes,
|
|
407
|
-
but have not tested its backwards compatibility (e.g. on SERPs collected prior to
|
|
408
|
-
2/2020). More generally, there's no guarantee on future compatibility. In fact,
|
|
409
|
-
there is almost certainly the opposite: more changes will inevitably occur.
|
|
410
|
-
If you have older data that you need to parse and the current parser doesn't work,
|
|
411
|
-
you can try using `0.2.1`, or send a pull request if you find a way to make both work!
|
|
412
|
-
|
|
413
|
-
|
|
414
|
-
---
|
|
415
297
|
## Similar Packages
|
|
416
298
|
|
|
417
299
|
Many of the packages I've found for collecting web search data via python are no longer maintained, but others are still ongoing and interesting or useful. The primary strength of WebSearcher is its parser, which provides a level of detail that enables examinations of SERP [composition](http://dl.acm.org/citation.cfm?doid=3178876.3186143) by recording the type and position of each result, and its modular design, which has allowed us to (itermittenly) maintain it for so long and to cover such a wide array of component types (currently 25 without considering `sub_types`). Feel free to add to the list of packages or services through a pull request if you are aware of others:
|