WebSearcher 0.7.0__tar.gz → 0.7.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (203) hide show
  1. websearcher-0.7.2/.github/dependabot.yml +10 -0
  2. websearcher-0.7.2/.github/pull_request_template.md +5 -0
  3. websearcher-0.7.2/.github/workflows/publish.yml +33 -0
  4. websearcher-0.7.2/.github/workflows/test.yml +26 -0
  5. websearcher-0.7.2/.gitignore +22 -0
  6. websearcher-0.7.2/.pre-commit-config.yaml +16 -0
  7. websearcher-0.7.2/CHANGELOG.md +199 -0
  8. websearcher-0.7.2/CLAUDE.md +73 -0
  9. {websearcher-0.7.0 → websearcher-0.7.2}/PKG-INFO +24 -142
  10. {websearcher-0.7.0 → websearcher-0.7.2}/README.md +20 -138
  11. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/__init__.py +1 -1
  12. websearcher-0.7.2/WebSearcher/classifiers/__init__.py +7 -0
  13. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/classifiers/footer.py +1 -4
  14. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/classifiers/main.py +108 -26
  15. websearcher-0.7.2/WebSearcher/component_parsers/__init__.py +107 -0
  16. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/ads.py +85 -102
  17. websearcher-0.7.2/WebSearcher/component_parsers/available_on.py +50 -0
  18. websearcher-0.7.2/WebSearcher/component_parsers/banner.py +39 -0
  19. websearcher-0.7.2/WebSearcher/component_parsers/discussions_and_forums.py +59 -0
  20. websearcher-0.7.2/WebSearcher/component_parsers/footer.py +36 -0
  21. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/general.py +236 -246
  22. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/general_questions.py +17 -22
  23. websearcher-0.7.2/WebSearcher/component_parsers/images.py +105 -0
  24. websearcher-0.7.2/WebSearcher/component_parsers/knowledge.py +164 -0
  25. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/knowledge_rhs.py +46 -38
  26. websearcher-0.7.2/WebSearcher/component_parsers/latest_from.py +12 -0
  27. websearcher-0.7.2/WebSearcher/component_parsers/local_news.py +12 -0
  28. websearcher-0.7.2/WebSearcher/component_parsers/local_results.py +128 -0
  29. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/locations.py +16 -13
  30. websearcher-0.7.2/WebSearcher/component_parsers/map_results.py +21 -0
  31. websearcher-0.7.2/WebSearcher/component_parsers/news_quotes.py +55 -0
  32. websearcher-0.7.2/WebSearcher/component_parsers/notices.py +142 -0
  33. websearcher-0.7.2/WebSearcher/component_parsers/people_also_ask.py +34 -0
  34. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/perspectives.py +22 -24
  35. websearcher-0.7.2/WebSearcher/component_parsers/recent_posts.py +12 -0
  36. websearcher-0.7.2/WebSearcher/component_parsers/scholarly_articles.py +23 -0
  37. websearcher-0.7.2/WebSearcher/component_parsers/searches_related.py +60 -0
  38. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/shopping_ads.py +78 -84
  39. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/short_videos.py +11 -11
  40. websearcher-0.7.2/WebSearcher/component_parsers/top_image_carousel.py +40 -0
  41. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/top_stories.py +101 -98
  42. websearcher-0.7.2/WebSearcher/component_parsers/twitter_cards.py +58 -0
  43. websearcher-0.7.2/WebSearcher/component_parsers/twitter_result.py +37 -0
  44. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/component_parsers/videos.py +28 -15
  45. websearcher-0.7.2/WebSearcher/component_parsers/view_more_news.py +50 -0
  46. websearcher-0.7.2/WebSearcher/component_types.py +442 -0
  47. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/components.py +2 -4
  48. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/extractor_footer.py +2 -0
  49. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/extractor_main.py +86 -28
  50. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/locations.py +36 -47
  51. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/configs.py +4 -2
  52. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/data.py +3 -0
  53. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/searches.py +4 -1
  54. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/search_methods/requests_searcher.py +6 -5
  55. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/search_methods/selenium_searcher.py +29 -20
  56. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/searchers.py +1 -0
  57. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/utils.py +56 -22
  58. websearcher-0.7.2/docs/README.md +8 -0
  59. websearcher-0.7.2/docs/guides/changelog.md +68 -0
  60. websearcher-0.7.2/docs/plans/000-component-parsers-update.md +514 -0
  61. websearcher-0.7.2/docs/plans/001-component-parser-details-field.md +404 -0
  62. websearcher-0.7.2/docs/plans/002-class-consolidation.md +324 -0
  63. websearcher-0.7.2/docs/plans/003-ad-parser-structure.md +50 -0
  64. websearcher-0.7.2/docs/plans/004-ads-vs-other-parsers.md +130 -0
  65. websearcher-0.7.2/docs/plans/005-parser-updates-v0.6.7a2.md +176 -0
  66. websearcher-0.7.2/docs/plans/006-formalize-get-title.md +352 -0
  67. websearcher-0.7.2/docs/plans/007-formalize-get-title-prompt.md +80 -0
  68. websearcher-0.7.2/docs/plans/008-ci-test-data.md +80 -0
  69. websearcher-0.7.2/docs/plans/009-refactor-feature-extractor.md +75 -0
  70. websearcher-0.7.2/docs/plans/010-things-to-know-classifier.md +40 -0
  71. websearcher-0.7.2/docs/plans/011-structured-data-in-html.md +80 -0
  72. websearcher-0.7.2/docs/plans/012-parsing-diagnostics.md +65 -0
  73. websearcher-0.7.2/docs/plans/013-dom-position-reorder.md +153 -0
  74. websearcher-0.7.2/docs/plans/014-bump-0.6.9.md +90 -0
  75. websearcher-0.7.2/docs/plans/015-js-driven-urls.md +33 -0
  76. websearcher-0.7.2/docs/plans/016-standardize-data-models.md +237 -0
  77. websearcher-0.7.2/docs/plans/017-parse-pipeline-optimization.md +490 -0
  78. websearcher-0.7.2/docs/plans/018-visible-flag-on-results.md +230 -0
  79. websearcher-0.7.2/docs/plans/019-video-details-from-evlb-cards.md +123 -0
  80. {websearcher-0.7.0 → websearcher-0.7.2}/pyproject.toml +12 -8
  81. websearcher-0.7.2/scripts/show_parsed.py +116 -0
  82. websearcher-0.7.2/scripts/show_serp.py +127 -0
  83. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[01f85d1329ba].json +179 -64
  84. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[032572e185d3].json +37 -17
  85. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[0d3fc3b49b76].json +15 -3
  86. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[0ed311025efc].json +47 -7
  87. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[130eba186e94].json +33 -5
  88. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[18eccfe8454e].json +29 -9
  89. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[2c0aa0bbcd0c].json +51 -37
  90. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[2d1b05a046b2].json +15 -3
  91. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[305b53af69be].json +125 -80
  92. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[30c5d6bdb650].json +17 -5
  93. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[39617f527744].json +18 -6
  94. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[3c03a4a2cb7c].json +9 -3
  95. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[3c09a0f0c92f].json +3 -3
  96. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[3f5efb1dc358].json +19 -11
  97. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[45b6e019bfa2].json +37 -17
  98. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[4c8d8d2f226c].json +11 -3
  99. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[53940e35cc92].json +52 -13
  100. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[56cbcf8cd4dc].json +50 -11
  101. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[56f2eab63e9d].json +17 -5
  102. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[5898b04fb534].json +13 -5
  103. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[6978d0cd767d].json +34 -18
  104. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[6aa70651b0cd].json +36 -12
  105. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[6e206db14899].json +13 -5
  106. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[6e401e618433].json +64 -32
  107. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7049404a2dd6].json +39 -23
  108. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7333536d2911].json +56 -16
  109. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7ad9715f3597].json +35 -7
  110. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7b89c00120e3].json +61 -10
  111. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[7d76d3a83ebc].json +121 -76
  112. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[811a27f92284].json +415 -156
  113. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[83b17a6a7750].json +15 -3
  114. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[8d1b75b71e7f].json +36 -12
  115. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[8e820f7b024f].json +75 -23
  116. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[8f98fa9c0bef].json +15 -3
  117. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[9101d12ab778].json +38 -18
  118. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[923a428c1c22].json +49 -9
  119. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[97404b7b7c61].json +38 -18
  120. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[984065877aad].json +35 -7
  121. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[9a7e39d95bf0].json +19 -11
  122. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[9ed1baa7715d].json +122 -56
  123. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[a6c881e003e2].json +150 -26
  124. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[a6c8fe7fe769].json +18 -6
  125. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[aa594f199c3d].json +36 -12
  126. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[b15c5131b06c].json +12 -4
  127. websearcher-0.7.2/tests/__snapshots__/test_parse_serp/test_parse_serp[b186024ec98a].json +413 -0
  128. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[b2e1777bf0f2].json +199 -153
  129. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[be99c971b8f7].json +37 -17
  130. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[c48f8aa3f6da].json +13 -5
  131. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[c9ab650f5bda].json +37 -17
  132. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[cad43c3268a8].json +105 -41
  133. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[ce37f114963e].json +35 -7
  134. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[d1855fa9cd1c].json +51 -7
  135. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[d1ac0c4abb10].json +21 -9
  136. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[d920789249af].json +20 -8
  137. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[da9b4fce9ab0].json +33 -5
  138. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[dc5861b33dda].json +15 -3
  139. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[e71a1cb4cd70].json +122 -46
  140. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[e828d00dc1b3].json +13 -5
  141. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[eab14aa4ff5d].json +36 -8
  142. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[f006c9318116].json +3 -3
  143. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[f6fae1c9a96e].json +52 -20
  144. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[faa9c7c889db].json +57 -25
  145. websearcher-0.7.2/tests/fixtures/serps-v0.7.2-knowledge-subcards.json.bz2 +0 -0
  146. websearcher-0.7.2/tests/test_component_types.py +96 -0
  147. {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_utils.py +4 -6
  148. {websearcher-0.7.0 → websearcher-0.7.2}/uv.lock +189 -619
  149. websearcher-0.7.0/.github/workflows/publish.yml +0 -37
  150. websearcher-0.7.0/.github/workflows/test.yml +0 -32
  151. websearcher-0.7.0/.gitignore +0 -17
  152. websearcher-0.7.0/.pre-commit-config.yaml +0 -7
  153. websearcher-0.7.0/WebSearcher/classifiers/__init__.py +0 -11
  154. websearcher-0.7.0/WebSearcher/classifiers/header_components.py +0 -16
  155. websearcher-0.7.0/WebSearcher/classifiers/header_text.py +0 -163
  156. websearcher-0.7.0/WebSearcher/component_parsers/__init__.py +0 -104
  157. websearcher-0.7.0/WebSearcher/component_parsers/available_on.py +0 -37
  158. websearcher-0.7.0/WebSearcher/component_parsers/banner.py +0 -40
  159. websearcher-0.7.0/WebSearcher/component_parsers/discussions_and_forums.py +0 -60
  160. websearcher-0.7.0/WebSearcher/component_parsers/footer.py +0 -33
  161. websearcher-0.7.0/WebSearcher/component_parsers/images.py +0 -133
  162. websearcher-0.7.0/WebSearcher/component_parsers/knowledge.py +0 -149
  163. websearcher-0.7.0/WebSearcher/component_parsers/latest_from.py +0 -15
  164. websearcher-0.7.0/WebSearcher/component_parsers/local_news.py +0 -15
  165. websearcher-0.7.0/WebSearcher/component_parsers/local_results.py +0 -103
  166. websearcher-0.7.0/WebSearcher/component_parsers/map_results.py +0 -26
  167. websearcher-0.7.0/WebSearcher/component_parsers/news_quotes.py +0 -53
  168. websearcher-0.7.0/WebSearcher/component_parsers/notices.py +0 -173
  169. websearcher-0.7.0/WebSearcher/component_parsers/people_also_ask.py +0 -43
  170. websearcher-0.7.0/WebSearcher/component_parsers/recent_posts.py +0 -15
  171. websearcher-0.7.0/WebSearcher/component_parsers/scholarly_articles.py +0 -31
  172. websearcher-0.7.0/WebSearcher/component_parsers/searches_related.py +0 -53
  173. websearcher-0.7.0/WebSearcher/component_parsers/top_image_carousel.py +0 -38
  174. websearcher-0.7.0/WebSearcher/component_parsers/twitter_cards.py +0 -62
  175. websearcher-0.7.0/WebSearcher/component_parsers/twitter_result.py +0 -38
  176. websearcher-0.7.0/WebSearcher/component_parsers/view_more_news.py +0 -52
  177. websearcher-0.7.0/WebSearcher/models/cmpt_mappings.py +0 -193
  178. {websearcher-0.7.0 → websearcher-0.7.2}/.python-version +0 -0
  179. {websearcher-0.7.0 → websearcher-0.7.2}/LICENSE +0 -0
  180. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/__init__.py +0 -0
  181. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/extractor_header.py +0 -0
  182. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/extractors/extractor_rhs.py +0 -0
  183. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/feature_extractor.py +0 -0
  184. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/logger.py +0 -0
  185. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/__init__.py +0 -0
  186. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/models/features.py +0 -0
  187. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/parsers.py +0 -0
  188. {websearcher-0.7.0 → websearcher-0.7.2}/WebSearcher/search_methods/__init__.py +0 -0
  189. {websearcher-0.7.0 → websearcher-0.7.2}/scripts/condense_fixtures.py +0 -0
  190. {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_locations.py +0 -0
  191. {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_parse.py +0 -0
  192. {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_screenshot.py +0 -0
  193. {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_search.py +0 -0
  194. {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_search_headers.py +0 -0
  195. {websearcher-0.7.0 → websearcher-0.7.2}/scripts/demo_searches.py +0 -0
  196. {websearcher-0.7.0 → websearcher-0.7.2}/scripts/parsed_to_csv.py +0 -0
  197. {websearcher-0.7.0 → websearcher-0.7.2}/tests/__snapshots__/test_parse_serp/test_parse_serp[82e35954f552].json +0 -0
  198. {websearcher-0.7.0 → websearcher-0.7.2}/tests/fixtures/serps-v0.6.7.json.bz2 +0 -0
  199. {websearcher-0.7.0 → websearcher-0.7.2}/tests/fixtures/serps-v0.6.8.json.bz2 +0 -0
  200. {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_feature_extractor.py +0 -0
  201. {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_locations.py +0 -0
  202. {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_models.py +0 -0
  203. {websearcher-0.7.0 → websearcher-0.7.2}/tests/test_parse_serp.py +0 -0
@@ -0,0 +1,10 @@
1
+ version: 2
2
+ updates:
3
+ - package-ecosystem: github-actions
4
+ directory: /
5
+ schedule:
6
+ interval: weekly
7
+ - package-ecosystem: uv
8
+ directory: /
9
+ schedule:
10
+ interval: weekly
@@ -0,0 +1,5 @@
1
+ ## Summary
2
+ <!-- What does this PR do and why? -->
3
+
4
+ ## Test plan
5
+ <!-- How was this tested? -->
@@ -0,0 +1,33 @@
1
+ name: Publish
2
+
3
+ on:
4
+ push:
5
+ tags: ["v*"]
6
+
7
+ jobs:
8
+ build:
9
+ runs-on: ubuntu-latest
10
+ steps:
11
+ - uses: actions/checkout@v6
12
+ - uses: astral-sh/setup-uv@v8.1.0
13
+ - run: uv build
14
+ - uses: actions/upload-artifact@v7
15
+ with:
16
+ name: dist
17
+ path: dist/
18
+
19
+ publish:
20
+ needs: build
21
+ runs-on: ubuntu-latest
22
+ environment:
23
+ name: pypi
24
+ url: https://pypi.org/project/WebSearcher/
25
+ permissions:
26
+ id-token: write
27
+ contents: read
28
+ steps:
29
+ - uses: actions/download-artifact@v8
30
+ with:
31
+ name: dist
32
+ path: dist/
33
+ - uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,26 @@
1
+ name: Tests
2
+
3
+ on:
4
+ push:
5
+ branches: [dev, master]
6
+ pull_request:
7
+ branches: [dev, master]
8
+
9
+ permissions:
10
+ contents: read
11
+
12
+ jobs:
13
+ test:
14
+ runs-on: ubuntu-latest
15
+ strategy:
16
+ matrix:
17
+ python-version: ["3.12", "3.13", "3.14"]
18
+ steps:
19
+ - uses: actions/checkout@v6
20
+ - uses: astral-sh/setup-uv@v8.1.0
21
+ - run: uv python install ${{ matrix.python-version }}
22
+ - run: uv sync --all-groups --python ${{ matrix.python-version }}
23
+ - run: uv run ruff check .
24
+ - run: uv run ruff format --check .
25
+ - run: uv run pyrefly check
26
+ - run: uv run pytest --cov --cov-report=term-missing
@@ -0,0 +1,22 @@
1
+ .venv/
2
+ .archive/
3
+ .claude
4
+ TODO.md
5
+
6
+ build/
7
+ data/
8
+ notebooks/
9
+ scripts/ads-no-subtype/
10
+
11
+ *.egg-info
12
+ *__pycache__
13
+
14
+ # Caches
15
+ .pytest_cache/
16
+ .ruff_cache/
17
+ .coverage
18
+
19
+ # Environment files
20
+ .env
21
+ .env.*
22
+ !.env.example
@@ -0,0 +1,16 @@
1
+ repos:
2
+ - repo: https://github.com/astral-sh/ruff-pre-commit
3
+ rev: v0.15.5
4
+ hooks:
5
+ - id: ruff-format
6
+ - id: ruff
7
+ args: [--fix]
8
+ - repo: local
9
+ hooks:
10
+ - id: pyrefly-check
11
+ name: pyrefly check
12
+ entry: uv run pyrefly check
13
+ language: system
14
+ types_or: [python, pyi]
15
+ pass_filenames: false
16
+ require_serial: true
@@ -0,0 +1,199 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/).
7
+
8
+ ## [Unreleased]
9
+
10
+ ## [0.7.1] - 2026-05-03
11
+
12
+ - Added component type registry: consolidates header-text mappings, parser dispatch, and labels (supersedes `cmpt_mappings.py`)
13
+ - Added `find_by_selectors` utility; applied to classifiers
14
+ - Added `Selector` NamedTuple in `utils` for tag/attrs selector lists
15
+ - Added pyrefly type checking; fixed all type errors across the codebase
16
+ - Added structural tests for component type registry; resolved "Ver más" collision
17
+ - Refactored ad parser and classifier; tidied carousel scope; edited local parser/classifier selectors
18
+ - Restored `knowledge_panel` cmpt-attr check on `jscontroller qTdDb`
19
+ - Moved `ClassifyHeaderText` into `main.py` as `ClassifyMainHeader`; removed dead `ClassifyHeaderComponent`
20
+ - Updated CI: ruff lint + format checks, pyrefly check, coverage; matrix narrowed to 3.12-3.14
21
+ - Switched publish workflow to tag-based with split build + publish jobs
22
+ - Added Dependabot config, PR template, pre-commit ruff + pyrefly hooks
23
+ - Bumped support floor: `requires-python = ">=3.12"`, ruff/pyrefly `target-version = py312`
24
+ - Bumped security floors: `requests>=2.33.0`, `lxml>=6.1.0`, `pytest>=9.0.3`; pulled Dependabot patches for `pygments`
25
+ - Fixed `ResponseOutput` not subscriptable for dict-style access
26
+ - Added `CHANGELOG.md` and changelog guide (migrated from README)
27
+ - Added `docs/plans/` development history with landing page
28
+
29
+ ## [0.7.0] - 2026-03-15
30
+
31
+ - **Breaking:** `details` field is now always `dict | None` with a self-describing `type` key (e.g. `{"type": "menu", "items": [...]}`)
32
+ - **Breaking:** `parse_serp()` now always returns a dict with `results` and `features` keys; the `extract_features` parameter has been removed
33
+ - Standardized all models on Pydantic BaseModel (removed dataclasses)
34
+ - Added `ResponseOutput` and `ParsedSERP` typed models
35
+ - Removed `DetailsItem`, `DetailsList` classes
36
+ - Normalized `local_results` sub_type for location-specific headers
37
+ - Replaced `os` with `pathlib.Path` throughout
38
+ - Consolidated `webutils.py` into `utils.py`
39
+ - Added ruff formatting, linting, and pre-commit hooks
40
+ - Added test coverage reporting (69%)
41
+ - Added unit tests for utils, locations, models, and feature extractor
42
+ - Replaced pandas with polars in demo scripts
43
+
44
+ ## [0.6.9] - 2026-02-22
45
+
46
+ - Fixed bugs in component parsers (class comparison, assignment operator, set literal)
47
+ - Fixed `return` in `finally` block in requests searcher
48
+ - Added captcha detection to feature extractor
49
+ - Added captcha handling and jittered delay to demo searches
50
+ - Dropped pandas from core dependencies
51
+ - Cleaned up legacy typing imports
52
+ - Removed poetry.toml
53
+
54
+ ## [0.6.8] - 2026-02-20
55
+
56
+ - Migrated from Poetry to uv for dependency management
57
+ - Added Python 3.12-3.14 test matrix in GitHub Actions
58
+ - Added `flights` classifier and `standard-4` layout
59
+ - Added local service ad parser
60
+ - Extracted bottom ads before main column
61
+ - Fixed `return` in `finally` block warning in selenium searcher
62
+
63
+ ## [0.6.7] - 2026-02-06
64
+
65
+ - Added `get_text_by_selectors()` to `webutils` -- centralizes multi-selector fallback pattern across 7 component parsers
66
+ - Added `perspectives`, `recent_posts`, and `latest_from` component classifiers
67
+ - Added `sub_type` to perspectives parser from header text
68
+ - Added CI test workflow on push to dev branch
69
+ - Added compressed test fixtures with `condense_fixtures.py` script
70
+ - Updated dependency lower bounds for security patches (protobuf, orjson)
71
+ - Updated GitHub Actions to checkout v6 and setup-python v6
72
+
73
+ ## [0.6.6] - 2025-12-05
74
+
75
+ - Update packages with dependabot alerts (brotli, urllib3)
76
+
77
+ ## [0.6.5] - 2025-12-05
78
+
79
+ - Add GitHub Actions section to README
80
+
81
+ ## [0.6.0] - 2025-03-28
82
+
83
+ - Method for collecting data with selenium; requests no longer works without a redirect
84
+ - Pull request [#72](https://github.com/gitronald/WebSearcher/pull/72)
85
+
86
+ ## [0.5.2] - 2025-03-09
87
+
88
+ - Added support for Spanish component headers by text
89
+ - Pull request [#74](https://github.com/gitronald/WebSearcher/pull/74)
90
+
91
+ ## [0.5.1] - 2025-03-07
92
+
93
+ - Fixed canonical name -> UULE converter using `protobuf`, see [this gist](https://gist.github.com/gitronald/66cac42194ea2d489ff3a1e32651e736) for details
94
+ - Added lang arg to specify language in se.search, uses hl URL param and does not change Accept-Language request header (which defaults to en-US), but works in tests.
95
+ - Fixed null location/language arg input handling (again)
96
+ - Pull Request [#76](https://github.com/gitronald/WebSearcher/pull/76)
97
+
98
+ ## [0.5.0] - 2025-02-03
99
+
100
+ - configuration now using poetry v2
101
+
102
+ ## [0.4.9] - 2025-02-03
103
+
104
+ - last version with poetry v1, future versions (`>=0.5.0`) will use [poetry v2](https://python-poetry.org/blog/announcing-poetry-2.0.1/) configs.
105
+
106
+ ## [0.4.2] - [0.4.8] - 2024-11-11 to 2025-02-03
107
+
108
+ - varied parser updates, testing with py3.12.
109
+
110
+ ## [0.4.1] - 2024-08-26
111
+
112
+ - Added notices component types, including query edits, suggestions, language tips, and location tips.
113
+
114
+ ## [0.4.0] - 2024-05-27
115
+
116
+ - Restructured parser for component classes, split classifier into submodules for header, main, footer, etc., and rewrote extractors to work with component classes. Various bug fixes.
117
+
118
+ ## [0.3.13]
119
+
120
+ - New footer parser, broader extraction coverage, various bug and deprecation fixes.
121
+
122
+ ## [0.3.12] - 2024-05-09
123
+
124
+ - Added num_results to search args, added handling for local results text and labels (made by the SE), ignore hidden_survey type at extraction.
125
+
126
+ ## [0.3.11] - 2024-05-08
127
+
128
+ - Added extraction of labels for ads (made by the SE), use model validation, cleanup and various bug fixes.
129
+
130
+ ## [0.3.10] - 2024-05-06
131
+
132
+ - Updated component classifier for images, added exportable header text mappings, added gist on localized searches.
133
+
134
+ ## [0.3.9] - 2024-02-25
135
+
136
+ - Small fixes for video url parsing
137
+
138
+ ## [0.3.8] - 2024-02-13
139
+
140
+ - Using SERP pydantic model, added github pip publishing workflow
141
+
142
+ ## [0.3.7] - 2024-02-09
143
+
144
+ - Fixed localization, parser and classifier updates and fixes, image subtypes, changed rhs component handling.
145
+
146
+ ## [0.3.0] - [0.3.6] - 2023-10-16 to 2023-12-08
147
+
148
+ - Parser updates for SERPs from 2022 and 2023, standalone extractors file, added pydantic, reduced redundancies in outputs.
149
+
150
+ ## [2020.0.0], [2022.12.18], [2023.01.04] - 2022-12-19, 2022-12-21, and 2023-01-04
151
+
152
+ - Various updates, attempt at date versioning that seemed like a good idea at the time ¯\\\_(ツ)\_/¯
153
+
154
+ <!-- refs/tags/v2022.12.18 -->
155
+ <!-- refs/tags/v2023.01.04 -->
156
+
157
+ ## [0.2.15]
158
+
159
+ - Fix people-also-ask and hotel false positives, add flag for left-hand side bar
160
+
161
+ ## [0.2.14]
162
+
163
+ - Add shopping ads carousel and three knowledge subtypes (flights, hotels, events)
164
+
165
+ ## [0.2.13]
166
+
167
+ - Small fixes for knowledge subtypes, general subtypes, and ads
168
+
169
+ ## [0.2.12] - 2021-12-17
170
+
171
+ - Try to brotli decompress by default
172
+
173
+ ## [0.2.11] - 2021-11-15
174
+
175
+ - Fixed local result parser and no return in general extra details
176
+
177
+ ## [0.2.10]
178
+
179
+ - a) Add right-hand-side knowledge panel and top image carousel, b) Add knowledge and general component subtypes, c) Updates to component classifier, footer, ad, and people_also_ask components
180
+
181
+ ## [0.2.9] - 2021-05-08
182
+
183
+ - Various fixes for SERPs with a left-hand side bar, which are becoming more common and change other parts of the SERP layout.
184
+
185
+ ## [0.2.8] - 2021-03-08
186
+
187
+ - Small fixes due to HTML changes, such as missing titles and URLs in general components
188
+
189
+ ## [0.2.7] - 2020-11-30
190
+
191
+ - Added fix for parsing twitter cards, removed pandas dependencies and several unused functions, moving towards greater package simplicity.
192
+
193
+ ## [0.2.6] - 2020-11-14
194
+
195
+ - Updated ad parser for latest format, still handles older ad format.
196
+
197
+ ## [0.2.5] - 2020-07-24
198
+
199
+ - Google Search, like most online platforms, undergoes changes over time. These changes often affect not just their outward appearance, but the underlying code that parsers depend on. This makes parsing a goal with a moving target. Sometime around February 2020, Google changed a few elements of their HTML structure which broke this parser. I created this patch for these changes, but have not tested its backwards compatibility (e.g. on SERPs collected prior to 2/2020). More generally, there's no guarantee on future compatibility. In fact, there is almost certainly the opposite: more changes will inevitably occur. If you have older data that you need to parse and the current parser doesn't work, you can try using `0.2.1`, or send a pull request if you find a way to make both work!
@@ -0,0 +1,73 @@
1
+ # WebSearcher
2
+
3
+ Parser library for Google search result pages. Each SERP is decomposed into a typed list of `results` (e.g. `general`, `top_stories`, `perspectives`, `local_results`, `available_on`, `videos`, `searches_related`, ...) by an `Extractor` + `ClassifyMain`/`ClassifyFooter` pipeline, then per-component parsers populate `title`, `url`, `text`, `cite`, `details`, etc.
4
+
5
+ ## Running code
6
+
7
+ All Python execution goes through `uv` (the project's `.python-version` is 3.14):
8
+
9
+ ```bash
10
+ uv run python scripts/...
11
+ uv run pytest
12
+ ```
13
+
14
+ Do not use `poetry`, `python`, or `python3` directly. Some skill files still reference `poetry` from before the migration — treat those as stale.
15
+
16
+ ## Inspection scripts
17
+
18
+ Use these when diagnosing parser issues or auditing a demo dataset:
19
+
20
+ | Script | When |
21
+ |---|---|
22
+ | `scripts/show_parsed.py "{query}" --data-dir {dir}` | See the parsed-results table for a SERP (rank, type, sub_type, title, url, details summary). Re-parses fresh on every call — always reflects current code. |
23
+ | `scripts/show_serp.py "{query}" --data-dir {dir}` | Serve the saved HTML on localhost:8765 with Google's overlays/scroll-locks stripped. `--raw` keeps the original. |
24
+ | `scripts/demo_screenshot.py --data-dir {dir}` | Render a screenshot of the SERP with component-type colored borders. |
25
+ | `scripts/show_serp.py --list --data-dir {dir}` | List the queries in a dataset. |
26
+
27
+ Default `--data-dir` resolves to `data/demo-ws-v{ws.__version__}/`. Pass an explicit path for older fixtures or `/tmp` extracts.
28
+
29
+ ## Demo data layout
30
+
31
+ ```
32
+ data/demo-ws-v{version}/
33
+ serps.json # JSONL: {qry, html, serp_id, ...}
34
+ parsed.json # JSONL: {results: [...], features: {...}, ...} — pre-parsed output
35
+ searches.json # JSONL: {qry, serp_id, timestamp, ...} — search metadata
36
+ tests/fixtures/
37
+ serps-v0.6.7.json.bz2 # older bz2-compressed JSONL fixtures
38
+ serps-v0.6.8.json.bz2 # — useful for regressions; the "northern lights" SERP used in plan 018/019 lives here
39
+ ```
40
+
41
+ ## Skills (`.claude/skills/`)
42
+
43
+ | Skill | Use case |
44
+ |---|---|
45
+ | `/parser-update` | 7-phase parser diagnose-and-fix workflow on demo data. Now references the inspection scripts above. |
46
+ | `/compare-parsed` | Compare parsed output before/after uncommitted changes — regression check. |
47
+ | `/compare-selectors` | Diff selectors between current and committed versions of a parser file. |
48
+ | `/reparse` | Re-parse fixtures with current code and diff against pre-parsed output. |
49
+
50
+ ## Plans and TODO
51
+
52
+ - `docs/plans/{NNN}-{slug}.md` — design specs with `status: draft|active|done|abandoned` frontmatter
53
+ - `TODO.md` — gitignored, flat list of open/closed items, each pointing to a plan or marked `(no plan)`
54
+
55
+ Conventions per global rules in `~/.claude/rules/plan-files.md` and `~/.claude/rules/todo-files.md`.
56
+
57
+ ## Schema conventions
58
+
59
+ A `details` block on a parsed result uses a controlled vocabulary of `type` values. Try to reuse existing labels rather than invent new ones:
60
+
61
+ - `text` — `{items: [str, ...]}` (people_also_ask, searches_related)
62
+ - `hyperlinks` — `{items: [{url, text, ...}]}` (available_on, top_image_carousel, knowledge_rhs, footer image cards, general submenu)
63
+ - `ratings` — rating data, see general's modern rating widget and shopping_ads
64
+ - `place` — local_results: rating, n_reviews, price, category, address, phone, hours, review_snippet, website, directions
65
+ - `panel`, `video` — used by knowledge / general video subtypes
66
+
67
+ When a `details` block would have only null fields, set `details=None` instead of emitting a hollow dict.
68
+
69
+ ## Notable parser dependencies
70
+
71
+ - `top_stories.parse_top_stories` is shared by `top_stories`, `perspectives`, `local_news`, `recent_posts`, and `latest_from`. One selector change ripples to all five. Verify with `/compare-parsed` after edits.
72
+ - `parse_subtype_details` in `general.py` has its own elif chain for layout variants (subresult, submenu, submenu_rating, submenu_mini, submenu_scholarly, submenu_product). Some branches are dormant on modern SERPs; modern rating widget reads from `Y0A0hc`/`z3HNkc` aria-labels.
73
+ - `classifiers/main.py` runs an ordered chain of classifiers; order matters when components match multiple. `available_on` was recently moved ahead of `knowledge_panel` because the streaming-providers widget was being absorbed.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: WebSearcher
3
- Version: 0.7.0
3
+ Version: 0.7.2
4
4
  Summary: Tools for conducting, collecting, and parsing web search
5
5
  Project-URL: homepage, http://github.com/gitronald/WebSearcher
6
6
  Project-URL: repository, http://github.com/gitronald/WebSearcher
@@ -8,14 +8,14 @@ Author-email: "Ronald E. Robertson" <rer@acm.org>
8
8
  License-Expression: GPL-3.0
9
9
  License-File: LICENSE
10
10
  Keywords: parser,search,web
11
- Requires-Python: >=3.10
11
+ Requires-Python: >=3.12
12
12
  Requires-Dist: beautifulsoup4>=4.12.3
13
13
  Requires-Dist: brotli>=1.1.0
14
- Requires-Dist: lxml>=5.3.0
14
+ Requires-Dist: lxml>=6.1.0
15
15
  Requires-Dist: orjson<4.0.0,>=3.11.5
16
16
  Requires-Dist: protobuf<7.0.0,>=6.33.5
17
17
  Requires-Dist: pydantic>=2.9.2
18
- Requires-Dist: requests>=2.32.4
18
+ Requires-Dist: requests>=2.33.0
19
19
  Requires-Dist: selenium>=4.9.0
20
20
  Requires-Dist: tldextract>=5.1.2
21
21
  Requires-Dist: undetected-chromedriver>=3.5.5
@@ -29,41 +29,40 @@ This package provides tools for conducting algorithm audits of web search and
29
29
  includes a scraper built on `selenium` with tools for geolocating, conducting,
30
30
  and saving searches. It also includes a modular parser built on `BeautifulSoup`
31
31
  for decomposing a SERP into list of components with categorical classifications
32
- and position-based specifications.
32
+ and position-based specifications.
33
+
34
+ ## Recent Changes
35
+
36
+ - `0.7.1`: Added component type registry and pyrefly type checking; refreshed CI/tooling (lint, format, type-check, tag-based publish); bumped Python floor to 3.12
37
+ - `0.7.0`: Breaking changes, standardized data models on Pydantic, typed `details` field, and removed `DetailsItem`/`DetailsList`
38
+
39
+ See [CHANGELOG.md](CHANGELOG.md) for a longer history of changes by version.
33
40
 
34
41
  ## Table of Contents
35
42
 
36
43
  - [WebSearcher](#websearcher)
44
+ - [Tools for conducting and parsing web searches](#tools-for-conducting-and-parsing-web-searches)
45
+ - [Recent Changes](#recent-changes)
46
+ - [Table of Contents](#table-of-contents)
37
47
  - [Getting Started](#getting-started)
38
48
  - [Usage](#usage)
39
49
  - [Example Search Script](#example-search-script)
40
50
  - [Step by Step](#step-by-step)
51
+ - [1. Initialize Collector](#1-initialize-collector)
52
+ - [2. Conduct a Search](#2-conduct-a-search)
53
+ - [3. Parse Search Results](#3-parse-search-results)
54
+ - [4. Save HTML and Metadata](#4-save-html-and-metadata)
55
+ - [5. Save Parsed Results](#5-save-parsed-results)
41
56
  - [Localization](#localization)
42
57
  - [Contributing](#contributing)
58
+ - [Repair or Enhance a Parser](#repair-or-enhance-a-parser)
59
+ - [Add a Parser](#add-a-parser)
60
+ - [Testing](#testing)
61
+ - [Test Fixtures](#test-fixtures)
43
62
  - [GitHub Actions](#github-actions)
44
- - [Recent Updates](#recent-updates)
45
- - [Update Log](#update-log)
46
63
  - [Similar Packages](#similar-packages)
47
64
  - [License](#license)
48
65
 
49
- ---
50
- ## Recent Updates
51
-
52
- ### 0.7.0 (dev)
53
-
54
- - **Breaking:** `details` field is now always `dict | None` with a self-describing `type` key (e.g. `{"type": "menu", "items": [...]}`)
55
- - **Breaking:** `parse_serp()` now always returns a dict with `results` and `features` keys; the `extract_features` parameter has been removed
56
- - Standardized all models on Pydantic BaseModel (removed dataclasses)
57
- - Added `ResponseOutput` and `ParsedSERP` typed models
58
- - Removed `DetailsItem`, `DetailsList` classes
59
- - Normalized `local_results` sub_type for location-specific headers
60
- - Replaced `os` with `pathlib.Path` throughout
61
- - Consolidated `webutils.py` into `utils.py`
62
- - Added ruff formatting, linting, and pre-commit hooks
63
- - Added test coverage reporting (69%)
64
- - Added unit tests for utils, locations, models, and feature extractor
65
- - Replaced pandas with polars in demo scripts
66
-
67
66
  ---
68
67
  ## Getting Started
69
68
 
@@ -295,123 +294,6 @@ To release a new version:
295
294
  2. Once merged, the package is automatically published to PyPI
296
295
 
297
296
  ---
298
- ## Update Log
299
-
300
- `0.7.0`
301
- - Standardize data models on Pydantic, typed details field, remove DetailsItem/DetailsList
302
-
303
- `0.6.9`
304
- - Fixed bugs in component parsers (class comparison, assignment operator, set literal)
305
- - Fixed `return` in `finally` block in requests searcher
306
- - Added captcha detection to feature extractor
307
- - Added captcha handling and jittered delay to demo searches
308
- - Dropped pandas from core dependencies
309
- - Cleaned up legacy typing imports
310
- - Removed poetry.toml
311
-
312
- `0.6.8`
313
- - Migrated from Poetry to uv for dependency management
314
- - Added Python 3.12-3.14 test matrix in GitHub Actions
315
- - Added `flights` classifier and `standard-4` layout
316
- - Added local service ad parser
317
- - Extracted bottom ads before main column
318
- - Fixed `return` in `finally` block warning in selenium searcher
319
-
320
- `0.6.7`
321
- - Added `get_text_by_selectors()` to `webutils` -- centralizes multi-selector fallback pattern across 7 component parsers
322
- - Added `perspectives`, `recent_posts`, and `latest_from` component classifiers
323
- - Added `sub_type` to perspectives parser from header text
324
- - Added CI test workflow on push to dev branch
325
- - Added compressed test fixtures with `condense_fixtures.py` script
326
- - Updated dependency lower bounds for security patches (protobuf, orjson)
327
- - Updated GitHub Actions to checkout v6 and setup-python v6
328
-
329
- `0.6.6`
330
- - Update packages with dependabot alerts (brotli, urllib3)
331
-
332
- `0.6.5`
333
- - Add GitHub Actions section to README
334
-
335
- `0.6.0`
336
- - Method for collecting data with selenium; requests no longer works without a redirect
337
- - Pull request [#72](https://github.com/gitronald/WebSearcher/pull/72)
338
-
339
- `0.5.2`
340
- - Added support for Spanish component headers by text
341
- - Pull request [#74](https://github.com/gitronald/WebSearcher/pull/74)
342
-
343
- `0.5.1`
344
- - Fixed canonical name -> UULE converter using `protobuf`, see [this gist](https://gist.github.com/gitronald/66cac42194ea2d489ff3a1e32651e736) for details
345
- - Added lang arg to specify language in se.search, uses hl URL param and does not change Accept-Language request header (which defaults to en-US), but works in tests.
346
- - Fixed null location/language arg input handling (again)
347
- - Pull Request [#76](https://github.com/gitronald/WebSearcher/pull/76)
348
-
349
- `0.5.0`
350
- - configuration now using poetry v2
351
-
352
- `0.4.9` - last version with poetry v1, future versions (`>=0.5.0`) will use [poetry v2](https://python-poetry.org/blog/announcing-poetry-2.0.1/) configs.
353
-
354
- `0.4.2` - `0.4.8` - varied parser updates, testing with py3.12.
355
-
356
- `0.4.1` - Added notices component types, including query edits, suggestions, language tips, and location tips.
357
-
358
- `0.4.0` - Restructured parser for component classes, split classifier into submodules for header, main, footer, etc., and rewrote extractors to work with component classes. Various bug fixes.
359
-
360
- `0.3.13` - New footer parser, broader extraction coverage, various bug and deprecation fixes.
361
-
362
- `0.3.12` - Added num_results to search args, added handling for local results text and labels (made by the SE), ignore hidden_survey type at extraction.
363
-
364
- `0.3.11` - Added extraction of labels for ads (made by the SE), use model validation, cleanup and various bug fixes.
365
-
366
- `0.3.10` - Updated component classifier for images, added exportable header text mappings, added gist on localized searches.
367
-
368
- `0.3.9` - Small fixes for video url parsing
369
-
370
- `0.3.8` - Using SERP pydantic model, added github pip publishing workflow
371
-
372
- `0.3.7` - Fixed localization, parser and classifier updates and fixes, image subtypes, changed rhs component handling.
373
-
374
- `0.3.0` - `0.3.6` - Parser updates for SERPs from 2022 and 2023, standalone extractors file, added pydantic, reduced redundancies in outputs.
375
-
376
- `2020.0.0`, `2022.12.18`, `2023.01.04` - Various updates, attempt at date versioning that seemed like a good idea at the time ¯\\\_(ツ)\_/¯
377
-
378
- <!-- refs/tags/v2022.12.18 -->
379
- <!-- refs/tags/v2023.01.04 -->
380
-
381
- `0.2.15` - Fix people-also-ask and hotel false positives, add flag for left-hand side bar
382
-
383
- `0.2.14` - Add shopping ads carousel and three knowledge subtypes (flights, hotels, events)
384
-
385
- `0.2.13` - Small fixes for knowledge subtypes, general subtypes, and ads
386
-
387
- `0.2.12` - Try to brotli decompress by default
388
-
389
- `0.2.11` - Fixed local result parser and no return in general extra details
390
-
391
- `0.2.10` - a) Add right-hand-side knowledge panel and top image carousel, b) Add knowledge and general component subtypes, c) Updates to component classifier, footer, ad, and people_also_ask components
392
-
393
- `0.2.9` - Various fixes for SERPs with a left-hand side bar, which are becoming more common and change other parts of the SERP layout.
394
-
395
- `0.2.8` - Small fixes due to HTML changes, such as missing titles and URLs in general components
396
-
397
- `0.2.7` - Added fix for parsing twitter cards, removed pandas dependencies and
398
- several unused functions, moving towards greater package simplicity.
399
-
400
- `0.2.6` - Updated ad parser for latest format, still handles older ad format.
401
-
402
- `0.2.5` - Google Search, like most online platforms, undergoes changes over time.
403
- These changes often affect not just their outward appearance, but the underlying
404
- code that parsers depend on. This makes parsing a goal with a moving target.
405
- Sometime around February 2020, Google changed a few elements of their HTML
406
- structure which broke this parser. I created this patch for these changes,
407
- but have not tested its backwards compatibility (e.g. on SERPs collected prior to
408
- 2/2020). More generally, there's no guarantee on future compatibility. In fact,
409
- there is almost certainly the opposite: more changes will inevitably occur.
410
- If you have older data that you need to parse and the current parser doesn't work,
411
- you can try using `0.2.1`, or send a pull request if you find a way to make both work!
412
-
413
-
414
- ---
415
297
  ## Similar Packages
416
298
 
417
299
  Many of the packages I've found for collecting web search data via python are no longer maintained, but others are still ongoing and interesting or useful. The primary strength of WebSearcher is its parser, which provides a level of detail that enables examinations of SERP [composition](http://dl.acm.org/citation.cfm?doid=3178876.3186143) by recording the type and position of each result, and its modular design, which has allowed us to (itermittenly) maintain it for so long and to cover such a wide array of component types (currently 25 without considering `sub_types`). Feel free to add to the list of packages or services through a pull request if you are aware of others: