RubyGems - webshaker - Versions diffs - 0.0.2 → 0.0.3 - Mend

webshaker 0.0.2 → 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: d374df1a3ab5df427cb1708543b66b9dec579ec5e207572510d1a5e81a30354a
-  data.tar.gz: a463df17723b6fcf49856c12c90d268b55a140f6cf947f6ff36efa4cfde95860
+  metadata.gz: 4b03ca48d3495f70d5097d85325b4ab917b4ff0217b6634fcc50a9649e7ffc5e
+  data.tar.gz: 12925527140a41b0dd5e64a5551292a0fc97a7d42221bbe636d6517ec3d8d6ac
 SHA512:
-  metadata.gz: 9d1b9fcd446773d2b47ecd767534d4bcac371cdb01e6f55c56d211d6122aa9f8ee2ced4e80a539d125c5479eb141a307cbced70a43de7190cc42a4685ee21921
-  data.tar.gz: ed0c382f8a45f0e2cf125f35ccdb2ca3f4a2cbade022d947d0318232143cd4e11a5644ac0d03f43ea334adfae6cf9cdea08d9150f70ea7a254d4e00a6e44b0a5
+  metadata.gz: 2e2bc1973b04b9fec176b7cde9aa07908c29b83a9ec074d25d983e5f2b025b60b84ce6635483566325222868224f598d07ce9a8e7d4d778873104f920a0369d8
+  data.tar.gz: 88ac426fa0c12c9e2e2a45e2cf6ed23264402ef05c9be30f9cbd0b18c83caba46d38f26665197fb354161b7efb9fd692b8f0474928f990f8b613c0552f38ca6d

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    webshaker (0.0.2)
+    webshaker (0.0.3)
       nokogiri
       ruby-openai
       selenium-webdriver (~> 4.0)

data/README.md CHANGED Viewed

@@ -1,6 +1,11 @@
 # Webshaker
-An intelligent web scraper that uses Selenium WebDriver to scrape a URL and parse its content using AI.
+|  Tests |  Coverage  |
+|:-:|:-:|
+| [![Tests](https://github.com/mochetts/webshaker/actions/workflows/main.yml/badge.svg?branch=main)](https://github.com/mochetts/webshaker/actions/workflows/main.yml)  |  [![Codecov Coverage](https://codecov.io/github/mochetts/webshaker/graph/badge.svg?token=SKTT14JJGV)](https://codecov.io/github/mochetts/webshaker) |
+An intelligent web scraper that uses Selenium WebDriver to scrape a URL and parse it using AI.
 ## Installation
@@ -30,17 +35,81 @@ end
 ```ruby
 # Create a shaker out of a specific URL.
-# Make sure certain xpath is fulfilled before downloading the HTML content (to circumvent client dynamic hydration).
-# You can also wait for css: {wait_for: {css: ".some-class"}}
-# Or you can wait for some tag name: {wait_for: {tag_name: "body"}}
+shaker = Webshaker::Shaker.new("https://www.google.com")
+# Query anything about the website.
+result = shaker.shake(with_prompt: "What's this website about?")
+# => "This website appears to be the homepage of Google, specifically targeting users in Uruguay (as indicated by the reference to \"Uruguay\" and the Spanish language). It contains links to various Google services such as Gmail, Google Images, and a login page for Google accounts. Additionally, there are sections for user feedback, search functionalities, and links to Google policies and services. The layout includes buttons, forms, and elements for user interaction, typical of a search engine homepage."
+result = shaker.shake(with_prompt: "Give me a list of all the links", respond_with: :json)
+# =>
+#  {"links"=>
+#   ["https://mail.google.com/mail/&ogbl",
+#    "https://www.google.com/imghp?hl=es-419&ogbl",
+#    "https://accounts.google.com/ServiceLogin?hl=es-419&passive=true&continue=https://www.google.com/&ec=GAZAmgQ",
+#    "/search?sca_esv=08a7c6c574dce941&sca_upv=1&q=vela+olimpiadas&oi=ddle&ct=335645951&hl=es-419&sa=X&ved=0ahUKEwiBsrmL9taHAxWGq5UCHV1hEJkQPQgC",
+#    "https://about.google/?utm_source=google-UY&utm_medium=referral&utm_campaign=hp-footer&fg=1",
+#    "https://www.google.com/intl/es-419_uy/ads/?subid=ww-ww-et-g-awa-a-g_hpafoot1_1!o2&utm_source=google.com&utm_medium=referral&utm_campaign=google_hpafooter&fg=1",
+#    "https://www.google.com/services/?subid=ww-ww-et-g-awa-a-g_hpbfoot1_1!o2&utm_source=google.com&utm_medium=referral&utm_campaign=google_hpbfooter&fg=1",
+#    "https://google.com/search/howsearchworks/?fg=1",
+#    "https://policies.google.com/privacy?hl=es-419&fg=1",
+#    "https://policies.google.com/terms?hl=es-419&fg=1",
+#    "https://www.google.com/preferences?hl=es-419&fg=1",
+#    "/advanced_search?hl=es-419&fg=1",
+#    "/history/privacyadvisor/search/unauth?utm_source=googlemenu&fg=1&cctld=com",
+#    "/history/optout?hl=es-419&fg=1",
+#    "https://support.google.com/websearch/?p=ws_results_help&hl=es-419&fg=1"]}
+```
+### SPA dynamic hydration
+Sometimes, the page we want to scrape is a Single Page Application. This means that the initial HTML that is returned by the server does not contain the html that we want to scrape but a just a skeleton that gets filled in by JS scripting.
+In order to circumvent this, you can let the webshaker know that it needs to wait for some html content to appear before it scrapes the HTML.
+For example, you can wait for a certian XPath to show up:
+```rb
 shaker = Webshaker::Shaker.new(
   "https://www.google.com",
   {wait_for: {xpath: "/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[3]"}}
 )
-# Query anything about the website.
 result = shaker.shake(with_prompt: "What's this website about?")
-# => "This website appears to be the homepage of Google, specifically targeting users in Uruguay (as indicated by the reference to \"Uruguay\" and the Spanish language). It contains links to various Google services such as Gmail, Google Images, and a login page for Google accounts. Additionally, there are sections for user feedback, search functionalities, and links to Google policies and services. The layout includes buttons, forms, and elements for user interaction, typical of a search engine homepage."
+```
+You can also wait for certain css selector to be found:
+```rb
+shaker = Webshaker::Shaker.new(
+  "https://www.google.com",
+  {wait_for: {css: ".some-class"}}
+)
+result = shaker.shake(with_prompt: "What's this website about?")
+```
+Or you can wait for some specific node to become available:
+```rb
+shaker = Webshaker::Shaker.new(
+  "https://www.google.com",
+  {wait_for: {tag_name: "body"}}
+)
+result = shaker.shake(with_prompt: "What's this website about?")
+```
+### Responding with JSON
+In some scenarios (e.g when the communication is between 2 systems), it might be better to return json. To do so, just add the `respond_with: :json` keyword param to the `shake` call:
+```ruby
+# Create a shaker out of a specific URL.
+shaker = Webshaker::Shaker.new(
+  "https://www.google.com",
+  {wait_for: {xpath: "/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[3]"}}
+)
 result = shaker.shake(with_prompt: "Give me a list of all the links", respond_with: :json)
 # =>
@@ -62,6 +131,20 @@ result = shaker.shake(with_prompt: "Give me a list of all the links", respond_wi
 #    "https://support.google.com/websearch/?p=ws_results_help&hl=es-419&fg=1"]}
 ```
+### Temperature
+Sometimes we need to adjust the prompt temperature to give the result a bit of randomness. In order to do so, you can simply pass the `temperature` keyword to the `shake` call:
+```rb
+# Create a shaker out of a specific URL.
+shaker = Webshaker::Shaker.new(
+  "https://www.google.com",
+  {wait_for: {xpath: "/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[3]"}}
+)
+result = shaker.shake(with_prompt: "Give me a list of all the links", respond_with: :json, temperature: 0.2)
+```
 ## Development
 You can use `bin/console` to access an interactive console. This will preload environment variables from a `.env` file.

data/lib/webshaker/ai.rb ADDED Viewed

@@ -0,0 +1,42 @@
+module Webshaker
+  class Ai
+    attr_reader :html_content
+    def initialize(html_content)
+      @html_content = html_content
+    end
+    def analyze(with_prompt:, respond_with: :text, temperature: 0.8)
+      response = ai_client.chat(
+        parameters: {
+          model: Webshaker.config.model,
+          messages: messages(with_prompt).concat((respond_with.to_sym == :json) ? [{role: "user", content: "respond with json"}] : []),
+          temperature:
+        }.merge(
+          (respond_with == :json) ? {response_format: {type: "json_object"}} : {}
+        )
+      )
+      # Return full response from the ai client if the respond_with is set to :full
+      return response if respond_with === :full
+      response = response["choices"][0]["message"]["content"]
+      response = JSON.parse(response) if respond_with === :json
+      response
+    end
+    private
+    def ai_client
+      @client ||= OpenAI::Client.new(access_token: Webshaker.config.open_ai_key)
+    end
+    def messages(user_prompt)
+      [
+        {role: "system", content: "You are an HTML interpreter. The user will give you the contents of an HTML and ask you something about it. Please comply with what the user asks."},
+        {role: "user", content: html_content},
+        {role: "user", content: user_prompt}
+      ]
+    end
+  end
+end

data/lib/webshaker/shaker.rb CHANGED Viewed

@@ -10,11 +10,7 @@ module Webshaker
     end
     def shake(with_prompt:, respond_with: :text, temperature: 0.8)
-      ai_result(
-        with_prompt,
-        respond_with:,
-        temperature:
-      )
+      ai.analyze(with_prompt:, respond_with: :text, temperature: 0.8)
     end
     private
@@ -23,31 +19,8 @@ module Webshaker
       @html ||= Webshaker::Scraper.new(url, scrape_options).scrape.html
     end
-    def ai_result(user_prompt, respond_with: :text, temperature: 0.8)
-      response = ai_client.chat(
-        parameters: {
-          model: Webshaker.config.model,
-          messages: ai_prompt(user_prompt, html).concat((respond_with == :json) ? [{role: "user", content: "respond with json"}] : []),
-          temperature:
-        }.merge(
-          (respond_with == :json) ? {response_format: {type: "json_object"}} : {}
-        )
-      )
-      response = response["choices"][0]["message"]["content"]
-      response = JSON.parse(response) if respond_with === :json
-      response
-    end
-    def ai_client
-      @client ||= OpenAI::Client.new(access_token: Webshaker.config.open_ai_key)
-    end
-    def ai_prompt(user_prompt, html_content)
-      [
-        {role: "system", content: "You are an HTML interpreter. The user will give you the contents of an HTML and ask you something about it. Please comply with what the user asks."},
-        {role: "user", content: html_content},
-        {role: "user", content: user_prompt}
-      ]
+    def ai
+      @ai ||= Webshaker::Ai.new(html)
     end
   end
 end

data/lib/webshaker/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Webshaker
-  VERSION = "0.0.2"
+  VERSION = "0.0.3"
 end

data/lib/webshaker.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 require "webshaker/scrape_result"
 require "webshaker/scraper"
+require "webshaker/ai"
 require "webshaker/shaker"
 module Webshaker

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: webshaker
 version: !ruby/object:Gem::Version
-  version: 0.0.2
+  version: 0.0.3
 platform: ruby
 authors:
 - Martin Mochetti
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-08-02 00:00:00.000000000 Z
+date: 2024-08-16 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: selenium-webdriver
@@ -70,6 +70,7 @@ files:
 - README.md
 - Rakefile
 - lib/webshaker.rb
+- lib/webshaker/ai.rb
 - lib/webshaker/scrape_result.rb
 - lib/webshaker/scraper.rb
 - lib/webshaker/shaker.rb