PyPI - ohmyscrapper - Versions diffs - 0.2.3__tar.gz → 0.4.0__tar.gz - Mend

ohmyscrapper 0.2.3tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.3
 Name: ohmyscrapper
-Version: 0.2.3
-Summary: This project aims to create a text-based scraper containing links to create a final PDF with general information about job openings.
+Version: 0.4.0
+Summary: OhMyScrapper scrapes texts and urls looking for links and jobs-data to create a final report with general information about job positions.
 Author: Cesar Cardoso
 Author-email: Cesar Cardoso <hello@cesarcardoso.cc>
 Requires-Dist: beautifulsoup4>=4.14.3
@@ -16,16 +16,17 @@ Requires-Dist: urlextract>=1.9.0
 Requires-Python: >=3.11
 Description-Content-Type: text/markdown
-# OhMyScrapper - v0.2.3
+# 🐶 OhMyScrapper - v0.4.0
-This project aims to create a text-based scraper containing links to create a
-final PDF with general information about job openings.
+OhMyScrapper scrapes texts and urls looking for links and jobs-data to create a
+final report with general information about job positions.
 ## Scope
 - Read texts;
-- Extract links;
-- Use meta og:tags to extract information;
+- Extract and load urls;
+- Scrapes the urls looking for og:tags and titles;
+- Export a list of links with relevant information;
 ## Installation
@@ -50,7 +51,7 @@ uvx ohmyscrapper --version
 OhMyScrapper works in 3 stages:
-1. It collects and loads urls from a text (by default `input/_chat.txt`) in a database;
+1. It collects and loads urls from a text in a database;
 2. It scraps/access the collected urls and read what is relevant. If it finds new urls, they are collected as well;
 3. Export a list of urls in CSV files;
@@ -58,7 +59,7 @@ You can do 3 stages with the command:
 ```shell
 ohmyscrapper start
 ```
-> Remember to add your text file in the folder `/input` with the name `_chat.txt`!
+> Remember to add your text file in the folder `/input` with the name that finishes with `.txt`!
 You will find the exported files in the folder `/output` like this:
 - `/output/report.csv`
@@ -70,18 +71,23 @@ You will find the exported files in the folder `/output` like this:
 ### BUT: if you want to do step by step, here it is:
-First we load a text file you would like to look for urls, the idea here is to
-use the whatsapp history, but it works with any txt file.
+First we load a text file you would like to look for urls. It it works with any txt file.
-The default file is `input/_chat.txt`. If you have the default file you just use
-the command `load`:
+The default folder is `/input`. Put one or more text (finished with `.txt`) files
+in this folder and use the command `load`:
 ```shell
 ohmyscrapper load
 ```
-or, if you have another file, just use the argument `-file` like this:
+or, if you have another file in a different folder, just use the argument `-input` like this:
 ```shell
-ohmyscrapper load -file=my-text-file.txt
+ohmyscrapper load -input=my-text-file.txt
 ```
+In this case, you can add an url directly to the database, like this:
+```shell
+ohmyscrapper load -input=https://cesarcardoso.cc/
+```
+That will append the last url in the database to be scraped.
 That will create a database if it doesn't exist and store every url the oh-my-scrapper
 find. After that, let's scrap the urls with the command `scrap-urls`:

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/README.md RENAMED Viewed

@@ -1,13 +1,14 @@
-# OhMyScrapper - v0.2.3
+# 🐶 OhMyScrapper - v0.4.0
-This project aims to create a text-based scraper containing links to create a
-final PDF with general information about job openings.
+OhMyScrapper scrapes texts and urls looking for links and jobs-data to create a
+final report with general information about job positions.
 ## Scope
 - Read texts;
-- Extract links;
-- Use meta og:tags to extract information;
+- Extract and load urls;
+- Scrapes the urls looking for og:tags and titles;
+- Export a list of links with relevant information;
 ## Installation
@@ -32,7 +33,7 @@ uvx ohmyscrapper --version
 OhMyScrapper works in 3 stages:
-1. It collects and loads urls from a text (by default `input/_chat.txt`) in a database;
+1. It collects and loads urls from a text in a database;
 2. It scraps/access the collected urls and read what is relevant. If it finds new urls, they are collected as well;
 3. Export a list of urls in CSV files;
@@ -40,7 +41,7 @@ You can do 3 stages with the command:
 ```shell
 ohmyscrapper start
 ```
-> Remember to add your text file in the folder `/input` with the name `_chat.txt`!
+> Remember to add your text file in the folder `/input` with the name that finishes with `.txt`!
 You will find the exported files in the folder `/output` like this:
 - `/output/report.csv`
@@ -52,18 +53,23 @@ You will find the exported files in the folder `/output` like this:
 ### BUT: if you want to do step by step, here it is:
-First we load a text file you would like to look for urls, the idea here is to
-use the whatsapp history, but it works with any txt file.
+First we load a text file you would like to look for urls. It it works with any txt file.
-The default file is `input/_chat.txt`. If you have the default file you just use
-the command `load`:
+The default folder is `/input`. Put one or more text (finished with `.txt`) files
+in this folder and use the command `load`:
 ```shell
 ohmyscrapper load
 ```
-or, if you have another file, just use the argument `-file` like this:
+or, if you have another file in a different folder, just use the argument `-input` like this:
 ```shell
-ohmyscrapper load -file=my-text-file.txt
+ohmyscrapper load -input=my-text-file.txt
 ```
+In this case, you can add an url directly to the database, like this:
+```shell
+ohmyscrapper load -input=https://cesarcardoso.cc/
+```
+That will append the last url in the database to be scraped.
 That will create a database if it doesn't exist and store every url the oh-my-scrapper
 find. After that, let's scrap the urls with the command `scrap-urls`:

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/pyproject.toml RENAMED Viewed

@@ -1,7 +1,7 @@
 [project]
 name = "ohmyscrapper"
-version = "0.2.3"
-description = "This project aims to create a text-based scraper containing links to create a final PDF with general information about job openings."
+version = "0.4.0"
+description = "OhMyScrapper scrapes texts and urls looking for links and jobs-data to create a final report with general information about job positions."
 readme = "README.md"
 authors = [
     { name = "Cesar Cardoso", email = "hello@cesarcardoso.cc" }

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/__init__.py RENAMED Viewed

@@ -19,20 +19,22 @@ from ohmyscrapper.modules.merge_dbs import merge_dbs
 def main():
     parser = argparse.ArgumentParser(prog="ohmyscrapper")
-    parser.add_argument("--version", action="version", version="%(prog)s v0.2.3")
+    parser.add_argument("--version", action="version", version="%(prog)s v0.4.0")
     subparsers = parser.add_subparsers(dest="command", help="Available commands")
     start_parser = subparsers.add_parser(
-        "start", help="Make the entire process of loading, processing and exporting with the default configuration."
+        "start",
+        help="Make the entire process of 📦 loading, 🐶 scraping and 📜🖋️ exporting with the default configuration.",
     )
     start_parser.add_argument(
-        "--ai", default=False, help="Make the entire process of loading, processing, reprocessing with AI and exporting with the default configuration.", action="store_true"
+        "--ai",
+        default=False,
+        help="Make the entire process of loading, processing, reprocessing with AI and exporting with the default configuration.",
+        action="store_true",
     )
-    ai_process_parser = subparsers.add_parser(
-        "ai", help="Process with AI."
-    )
+    ai_process_parser = subparsers.add_parser("ai", help="Process with AI.")
     ai_process_parser.add_argument(
         "--history", default=False, help="Reprocess ai history", action="store_true"
     )
@@ -51,12 +53,13 @@ def main():
         "--recursive", default=False, help="Run in recursive mode", action="store_true"
     )
-    load_txt_parser = subparsers.add_parser("load", help="Load txt file")
+    load_txt_parser = subparsers.add_parser("load", help="📦 Load txt file")
+    load_txt_parser.add_argument("-input", default=None, help="File path or url.")
     load_txt_parser.add_argument(
-        "-file", default="input/_chat.txt", help="File path. Default is input/_chat.txt"
+        "--verbose", default=False, help="Run in verbose mode", action="store_true"
     )
-    scrap_urls_parser = subparsers.add_parser("scrap-urls", help="Scrap urls")
+    scrap_urls_parser = subparsers.add_parser("scrap-urls", help="🐶 Scrap urls")
     scrap_urls_parser.add_argument(
         "--recursive", default=False, help="Run in recursive mode", action="store_true"
     )
@@ -69,8 +72,11 @@ def main():
     scrap_urls_parser.add_argument(
         "--only-parents", default=False, help="Only parents urls", action="store_true"
     )
+    scrap_urls_parser.add_argument(
+        "--verbose", default=False, help="Run in verbose mode", action="store_true"
+    )
-    sniff_url_parser = subparsers.add_parser("sniff-url", help="Check url")
+    sniff_url_parser = subparsers.add_parser("sniff-url", help="🐕 Sniff/Check url")
     sniff_url_parser.add_argument(
         "url", default="https://cesarcardoso.cc/", help="Url to sniff"
     )
@@ -82,7 +88,7 @@ def main():
     show_urls_parser.add_argument("--limit", default=0, help="Limit of lines to show")
     show_urls_parser.add_argument("-url", default="", help="Url to show")
-    export_parser = subparsers.add_parser("export", help="Export urls to csv.")
+    export_parser = subparsers.add_parser("export", help="📊🖋️ Export urls to csv.")
     export_parser.add_argument("--limit", default=0, help="Limit of lines to export")
     export_parser.add_argument(
         "--file",
@@ -96,14 +102,11 @@ def main():
         action="store_true",
     )
-    report_parser = subparsers.add_parser("report", help="Export urls report to csv.")
+    report_parser = subparsers.add_parser(
+        "report", help="📜🖋️ Export urls report to csv."
+    )
     merge_parser = subparsers.add_parser("merge_dbs", help="Merge databases.")
-    # TODO: What is that?
-    # seed_parser.set_defaults(func=seed)
-    # classify_urls_parser.set_defaults(func=classify_urls)
-    # load_txt_parser.set_defaults(func=load_txt)
     args = parser.parse_args()
     if args.command == "classify-urls":
@@ -111,7 +114,7 @@ def main():
         return
     if args.command == "load":
-        load_txt(args.file)
+        load_txt(file_name=args.input, verbose=args.verbose)
         return
     if args.command == "seed":
@@ -132,6 +135,7 @@ def main():
             ignore_valid_prefix=args.ignore_type,
             randomize=args.randomize,
             only_parents=args.only_parents,
+            verbose=args.verbose,
         )
         return
@@ -166,7 +170,12 @@ def main():
     if args.command == "start":
         load_txt()
-        scrap_urls(recursive=True,ignore_valid_prefix=True,randomize=False,only_parents=False)
+        scrap_urls(
+            recursive=True,
+            ignore_valid_prefix=True,
+            randomize=False,
+            only_parents=False,
+        )
         if args.ai:
             process_with_ai()
         export_urls()

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/models/urls_manager.py RENAMED Viewed

@@ -17,14 +17,21 @@ def get_db_path():
 def get_db_connection():
+    if not os.path.exists(get_db_path()):
+        create_tables(sqlite3.connect(get_db_path()))
     return sqlite3.connect(get_db_path())
-# TODO: check if it makes sense
-conn = get_db_connection()
+def use_connection(func):
+    def provide_connection(*args, **kwargs):
+        global conn
+        with get_db_connection() as conn:
+            return func(*args, **kwargs)
+    return provide_connection
-def create_tables():
+def create_tables(conn):
     c = conn.cursor()
     c.execute(
@@ -38,27 +45,19 @@ def create_tables():
         "CREATE TABLE IF NOT EXISTS urls_valid_prefix (id INTEGER PRIMARY KEY, url_prefix TEXT UNIQUE, url_type TEXT)"
     )
-    return pd.read_sql_query("SELECT * FROM urls LIMIT 100", conn)
-# TODO: not sure this should be something. depends on the project
 def seeds():
-    create_tables()
     add_urls_valid_prefix("https://%.linkedin.com/posts/%", "linkedin_post")
     add_urls_valid_prefix("https://lnkd.in/%", "linkedin_redirect")
     add_urls_valid_prefix("https://%.linkedin.com/jobs/view/%", "linkedin_job")
     add_urls_valid_prefix("https://%.linkedin.com/feed/%", "linkedin_feed")
     add_urls_valid_prefix("https://%.linkedin.com/company/%", "linkedin_company")
-    # add_urls_valid_prefix("%.pdf", "pdf")
-    # add_url('https://imazon.org.br/categorias/artigos-cientificos/')
     return True
+@use_connection
 def add_urls_valid_prefix(url_prefix, url_type):
-    conn = get_db_connection()
     df = pd.read_sql_query(
         f"SELECT * FROM urls_valid_prefix WHERE url_prefix = '{url_prefix}'", conn
@@ -72,6 +71,7 @@ def add_urls_valid_prefix(url_prefix, url_type):
         conn.commit()
+@use_connection
 def get_urls_valid_prefix_by_type(url_type):
     df = pd.read_sql_query(
         f"SELECT * FROM urls_valid_prefix WHERE url_type = '{url_type}'", conn
@@ -79,12 +79,14 @@ def get_urls_valid_prefix_by_type(url_type):
     return df
+@use_connection
 def get_urls_valid_prefix_by_id(id):
     df = pd.read_sql_query(f"SELECT * FROM urls_valid_prefix WHERE id = '{id}'", conn)
     return df
 # TODO: pagination required
+@use_connection
 def get_urls_valid_prefix(limit=0):
     if limit > 0:
         df = pd.read_sql_query(f"SELECT * FROM urls_valid_prefix LIMIT {limit}", conn)
@@ -94,6 +96,7 @@ def get_urls_valid_prefix(limit=0):
 # TODO: pagination required
+@use_connection
 def get_urls(limit=0):
     if limit > 0:
         df = pd.read_sql_query(
@@ -104,6 +107,7 @@ def get_urls(limit=0):
     return df
+@use_connection
 def get_urls_report():
     sql = """
     WITH parent_url AS (
@@ -138,6 +142,7 @@ def get_urls_report():
     return df
+@use_connection
 def get_url_by_url(url):
     url = clean_url(url)
     df = pd.read_sql_query(f"SELECT * FROM urls WHERE url = '{url}'", conn)
@@ -145,12 +150,14 @@ def get_url_by_url(url):
     return df
+@use_connection
 def get_url_by_id(id):
     df = pd.read_sql_query(f"SELECT * FROM urls WHERE id = '{id}'", conn)
     return df
+@use_connection
 def get_urls_by_url_type(url_type):
     df = pd.read_sql_query(
         f"SELECT * FROM urls WHERE history = 0 AND url_type = '{url_type}'", conn
@@ -158,6 +165,7 @@ def get_urls_by_url_type(url_type):
     return df
+@use_connection
 def get_urls_by_url_type_for_ai_process(url_type="linkedin_post", limit=10):
     df = pd.read_sql_query(
         f"SELECT * FROM urls WHERE history = 0 AND url_type = '{url_type}' AND ai_processed = 0 LIMIT {limit}",
@@ -166,6 +174,7 @@ def get_urls_by_url_type_for_ai_process(url_type="linkedin_post", limit=10):
     return df
+@use_connection
 def get_url_like_unclassified(like_condition):
     df = pd.read_sql_query(
         f"SELECT * FROM urls WHERE history = 0 AND url LIKE '{like_condition}' AND url_type IS NULL",
@@ -174,6 +183,7 @@ def get_url_like_unclassified(like_condition):
     return df
+@use_connection
 def add_url(url, h1=None, parent_url=None):
     url = clean_url(url)
     c = conn.cursor()
@@ -196,6 +206,7 @@ def add_url(url, h1=None, parent_url=None):
     return get_url_by_url(url)
+@use_connection
 def add_ai_log(instructions, response, model, prompt_file, prompt_name):
     c = conn.cursor()
@@ -205,10 +216,14 @@ def add_ai_log(instructions, response, model, prompt_file, prompt_name):
     )
     conn.commit()
+@use_connection
 def get_ai_log():
     df = pd.read_sql_query(f"SELECT * FROM ai_log", conn)
     return df
+@use_connection
 def set_url_destiny(url, destiny):
     url = clean_url(url)
     destiny = clean_url(destiny)
@@ -222,6 +237,7 @@ def set_url_destiny(url, destiny):
     conn.commit()
+@use_connection
 def set_url_h1(url, value):
     value = str(value).strip()
     url = clean_url(url)
@@ -230,6 +246,7 @@ def set_url_h1(url, value):
     conn.commit()
+@use_connection
 def set_url_h1_by_id(id, value):
     value = str(value).strip()
@@ -238,29 +255,44 @@ def set_url_h1_by_id(id, value):
     conn.commit()
+@use_connection
 def set_url_ai_processed_by_id(id, json_str):
     value = 1
     value = str(value).strip()
     c = conn.cursor()
-    c.execute("UPDATE urls SET ai_processed = ? , json_ai = ? WHERE id = ?", (value, json_str, id))
+    c.execute(
+        "UPDATE urls SET ai_processed = ? , json_ai = ? WHERE id = ?",
+        (value, json_str, id),
+    )
     conn.commit()
+@use_connection
 def set_url_empty_ai_processed_by_id(id, json_str="empty result"):
     value = 1
     value = str(value).strip()
     c = conn.cursor()
-    c.execute("UPDATE urls SET ai_processed = ? , json_ai = ? WHERE ai_processed = 0 AND id = ?", (value, json_str, id))
+    c.execute(
+        "UPDATE urls SET ai_processed = ? , json_ai = ? WHERE ai_processed = 0 AND id = ?",
+        (value, json_str, id),
+    )
     conn.commit()
+@use_connection
 def set_url_ai_processed_by_url(url, json_str):
     value = 1
     value = str(value).strip()
     url = clean_url(url)
     c = conn.cursor()
-    c.execute("UPDATE urls SET ai_processed = ?, json_ai = ? WHERE url = ?", (value, json_str, url))
+    c.execute(
+        "UPDATE urls SET ai_processed = ?, json_ai = ? WHERE url = ?",
+        (value, json_str, url),
+    )
     conn.commit()
+@use_connection
 def set_url_description(url, value):
     url = clean_url(url)
     c = conn.cursor()
@@ -268,6 +300,7 @@ def set_url_description(url, value):
     conn.commit()
+@use_connection
 def set_url_description_links(url, value):
     url = clean_url(url)
     c = conn.cursor()
@@ -275,6 +308,7 @@ def set_url_description_links(url, value):
     conn.commit()
+@use_connection
 def set_url_json(url, value):
     url = clean_url(url)
     c = conn.cursor()
@@ -282,6 +316,7 @@ def set_url_json(url, value):
     conn.commit()
+@use_connection
 def set_url_error(url, value):
     url = clean_url(url)
     c = conn.cursor()
@@ -289,6 +324,7 @@ def set_url_error(url, value):
     conn.commit()
+@use_connection
 def set_url_type_by_id(url_id, url_type):
     c = conn.cursor()
     c.execute(f"UPDATE urls SET url_type = '{url_type}' WHERE id = {url_id}")
@@ -312,6 +348,7 @@ def clean_url(url):
     return url
+@use_connection
 def get_untouched_urls(
     limit=10, randomize=True, ignore_valid_prefix=False, only_parents=True
 ):
@@ -331,6 +368,7 @@ def get_untouched_urls(
     return df
+@use_connection
 def touch_url(url):
     url = clean_url(url)
     c = conn.cursor()
@@ -338,6 +376,7 @@ def touch_url(url):
     conn.commit()
+@use_connection
 def untouch_url(url):
     url = clean_url(url)
     c = conn.cursor()
@@ -345,12 +384,14 @@ def untouch_url(url):
     conn.commit()
+@use_connection
 def untouch_all_urls():
     c = conn.cursor()
     c.execute("UPDATE urls SET last_touch = NULL WHERE history = 0")
     conn.commit()
+@use_connection
 def set_all_urls_as_history():
     c = conn.cursor()
     c.execute("UPDATE urls SET history = 1")
@@ -382,9 +423,9 @@ def merge_dbs() -> None:
                 row["description"],
                 row["json"],
             )
-        # ßmerge_url(df)
+@use_connection
 def merge_url(url, h1, last_touch, created_at, description, json):
     url = clean_url(url)
     c = conn.cursor()

ohmyscrapper-0.4.0/src/ohmyscrapper/modules/classify_urls.py ADDED Viewed

@@ -0,0 +1,27 @@
+import ohmyscrapper.models.urls_manager as urls_manager
+import pandas as pd
+import time
+def classify_urls(recursive=False):
+    urls_manager.seeds()
+    df = urls_manager.get_urls_valid_prefix()
+    keep_alive = True
+    while keep_alive:
+        print("#️⃣  URL Classifier woke up to classify urls!")
+        for index, row_prefix in df.iterrows():
+            df_urls = urls_manager.get_url_like_unclassified(
+                like_condition=row_prefix["url_prefix"]
+            )
+            for index, row_urls in df_urls.iterrows():
+                urls_manager.set_url_type_by_id(
+                    url_id=row_urls["id"], url_type=row_prefix["url_type"]
+                )
+        if not recursive:
+            print("#️⃣  URL Classifier said: I'm done! See you soon...")
+            keep_alive = False
+        else:
+            print("#️⃣  URL Classifier is taking a nap...")
+            time.sleep(10)

ohmyscrapper-0.4.0/src/ohmyscrapper/modules/load_txt.py ADDED Viewed

@@ -0,0 +1,98 @@
+import os
+from urlextract import URLExtract
+import ohmyscrapper.models.urls_manager as urls_manager
+def _increment_file_name(text_file_content, file_name):
+    print(f"reading and loading file `{file_name}`... ")
+    with open(file_name, "r") as f:
+        return text_file_content + f.read()
+def load_txt(file_name=None, verbose=False):
+    if not os.path.exists("db"):
+        os.mkdir("db")
+    if not os.path.exists("input"):
+        os.mkdir("input")
+    urls_manager.seeds()
+    text_file_content = ""
+    if file_name is not None:
+        print(f"📖 reading file `{file_name}`... ")
+        if not os.path.exists(file_name):
+            if file_name.startswith("https://") or file_name.startswith("http://"):
+                text_file_content = " " + file_name + " "
+            else:
+                print(f"\n file `{file_name}` not found.")
+                return
+        else:
+            text_file_content = _increment_file_name(
+                text_file_content=text_file_content, file_name=file_name
+            )
+    else:
+        print("📂 reading /input directory... ")
+        dir_files = "input"
+        text_files = os.listdir(dir_files)
+        for file in text_files:
+            if not file.endswith(".txt"):
+                text_files.remove(file)
+        if len(text_files) == 0:
+            print("No text files found in /input directory!")
+            return
+        elif len(text_files) == 1:
+            print(f"📖 reading file `{dir_files}/{text_files[0]}`... ")
+            text_file_content = _increment_file_name(
+                text_file_content=text_file_content,
+                file_name=dir_files + "/" + text_files[0],
+            )
+        else:
+            print("\nChoose a text file. Use `*` for process all and `q` to quit:")
+            for index, file in enumerate(text_files):
+                print(f"[{index}]:", dir_files + "/" + file)
+            # TODO: there is a better way for sure!
+            text_file_option = -1
+            while text_file_option < 0 or text_file_option >= len(text_files):
+                text_file_option = input("Enter the file number: ")
+                if text_file_option == "*":
+                    for file in text_files:
+                        text_file_content = _increment_file_name(
+                            text_file_content=text_file_content,
+                            file_name=dir_files + "/" + file,
+                        )
+                        text_file_option = 0
+                elif text_file_option == "q":
+                    return
+                elif text_file_option.isdigit():
+                    text_file_option = int(text_file_option)
+                    if text_file_option >= 0 and text_file_option < len(text_files):
+                        text_file_content = _increment_file_name(
+                            text_file_content=text_file_content,
+                            file_name=dir_files
+                            + "/"
+                            + text_files[int(text_file_option)],
+                        )
+    print("🔎 looking for urls...")
+    urls_found = put_urls_from_string(
+        text_to_process=text_file_content, verbose=verbose
+    )
+    print("--------------------")
+    print("files processed")
+    print(f"📦 {urls_found} urls were extracted and packed into the database")
+def put_urls_from_string(text_to_process, parent_url=None, verbose=False):
+    if isinstance(text_to_process, str):
+        extractor = URLExtract()
+        for url in extractor.find_urls(text_to_process):
+            urls_manager.add_url(url=url, parent_url=parent_url)
+            if verbose:
+                print(url, "added")
+        return len(extractor.find_urls(text_to_process))
+    else:
+        return 0

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/modules/process_with_ai.py RENAMED Viewed

@@ -7,9 +7,11 @@ import time
 import os
 import yaml
 import json
 # TODO: !!! REFACTOR !!!
 load_dotenv()
 def reprocess_ai_history():
     df = urls_manager.get_ai_log().to_dict(orient="records")
     for row in df:
@@ -17,28 +19,34 @@ def reprocess_ai_history():
 def process_ai_response(response):
-        job_positions = xml2dict(response)
-        for index, xml_item_children in job_positions.items():
-            for url_child_xml in xml_item_children:
-                url_parent = urls_manager.get_url_by_id(url_child_xml["id"])
-                if len(url_parent) > 0:
-                    url_parent = url_parent.iloc[0]
-                h1 = url_child_xml.copy()
-                del h1["id"]
-                del h1["url"]
-                h1 = " - ".join(h1.values())
-                if url_parent["description_links"] > 1 and url_child_xml["id"] != "":
-                    print("-- child updated -- \n", url_child_xml["url"] , ":", h1)
-                    urls_manager.set_url_h1(url_child_xml["url"], h1)
-                    urls_manager.set_url_ai_processed_by_url(url_child_xml["url"], str(json.dumps(url_child_xml)))
-                    if url_parent["url"] != url_child_xml["url"]:
-                        urls_manager.set_url_ai_processed_by_url(url_parent["url"], "children-update")
-                else:
-                    print("-- parent updated -- \n", url_parent["url"], ":", h1)
-                    urls_manager.set_url_h1(url_parent["url"], h1)
-                    urls_manager.set_url_ai_processed_by_url(url_parent["url"], str(json.dumps(url_child_xml)))
+    job_positions = xml2dict(response)
+    for index, xml_item_children in job_positions.items():
+        for url_child_xml in xml_item_children:
+            url_parent = urls_manager.get_url_by_id(url_child_xml["id"])
+            if len(url_parent) > 0:
+                url_parent = url_parent.iloc[0]
+            h1 = url_child_xml.copy()
+            del h1["id"]
+            del h1["url"]
+            h1 = " - ".join(h1.values())
+            if url_parent["description_links"] > 1 and url_child_xml["id"] != "":
+                print("-- child updated -- \n", url_child_xml["url"], ":", h1)
+                urls_manager.set_url_h1(url_child_xml["url"], h1)
+                urls_manager.set_url_ai_processed_by_url(
+                    url_child_xml["url"], str(json.dumps(url_child_xml))
+                )
+                if url_parent["url"] != url_child_xml["url"]:
+                    urls_manager.set_url_ai_processed_by_url(
+                        url_parent["url"], "children-update"
+                    )
+            else:
+                print("-- parent updated -- \n", url_parent["url"], ":", h1)
+                urls_manager.set_url_h1(url_parent["url"], h1)
+                urls_manager.set_url_ai_processed_by_url(
+                    url_parent["url"], str(json.dumps(url_child_xml))
+                )
 def xml2dict(xml_string):
@@ -46,19 +54,21 @@ def xml2dict(xml_string):
     children_items_dict = {}
     for item in soup.find_all():
-        if(item.parent.name == "[document]"):
+        if item.parent.name == "[document]":
             children_items_dict[item.name] = []
         elif item.parent.name in children_items_dict:
             children_items_dict[item.parent.name].append(_xml_children_to_dict(item))
     return children_items_dict
 def _xml_children_to_dict(xml):
     item_dict = {}
     for item in xml.find_all():
         item_dict[item.name] = item.text
     return item_dict
 def process_with_ai(recursive=True, triggered_times=0):
     triggered_times = triggered_times + 1
@@ -91,13 +101,23 @@ def process_with_ai(recursive=True, triggered_times=0):
     print("prompt:", prompt["name"])
     print("model:", prompt["model"])
     print("description:", prompt["description"])
-    prompt["instructions"] = prompt["instructions"].replace("{ohmyscrapper_texts}", texts)
+    prompt["instructions"] = prompt["instructions"].replace(
+        "{ohmyscrapper_texts}", texts
+    )
     # The client gets the API key from the environment variable `GEMINI_API_KEY`.
     client = genai.Client()
-    response = client.models.generate_content(model=prompt["model"], contents=prompt["instructions"])
+    response = client.models.generate_content(
+        model=prompt["model"], contents=prompt["instructions"]
+    )
     response = str(response.text)
-    urls_manager.add_ai_log(instructions=prompt["instructions"], response=response, model=prompt["model"], prompt_name=prompt["name"], prompt_file=prompt["prompt_file"])
+    urls_manager.add_ai_log(
+        instructions=prompt["instructions"],
+        response=response,
+        model=prompt["model"],
+        prompt_name=prompt["name"],
+        prompt_file=prompt["prompt_file"],
+    )
     print(response)
     print("^^^^^^")
     process_ai_response(response=response)
@@ -114,7 +134,9 @@ def process_with_ai(recursive=True, triggered_times=0):
         if triggered_times > 5:
             print("!!! This is a break to prevent budget accident$.")
             print("You triggered", triggered_times, "times the AI processing function.")
-            print("If you are sure this is correct, you can re-call this function again.")
+            print(
+                "If you are sure this is correct, you can re-call this function again."
+            )
             print("Please, check it.")
             return
@@ -122,6 +144,7 @@ def process_with_ai(recursive=True, triggered_times=0):
     return
 def _get_prompt():
     prompts_path = "prompts"
     default_prompt = """---
@@ -135,13 +158,17 @@ Process with AI this prompt: {ohmyscrapper_texts}
         os.mkdir(prompts_path)
         open(f"{prompts_path}/prompt.md", "w").write(default_prompt)
-        print(f"You didn't have a prompt file. One was created in the /{prompts_path} folder. You can change it there.")
+        print(
+            f"You didn't have a prompt file. One was created in the /{prompts_path} folder. You can change it there."
+        )
         return False
     prompt_files = os.listdir(prompts_path)
     if len(prompt_files) == 0:
         open(f"{prompts_path}/prompt.md", "w").write(default_prompt)
-        print(f"You didn't have a prompt file. One was created in the /{prompts_path} folder. You can change it there.")
+        print(
+            f"You didn't have a prompt file. One was created in the /{prompts_path} folder. You can change it there."
+        )
         return False
     prompt = {}
     if len(prompt_files) == 1:
@@ -151,8 +178,10 @@ Process with AI this prompt: {ohmyscrapper_texts}
         prompts = {}
         for index, file in enumerate(prompt_files):
             prompts[index] = _parse_prompt(prompts_path=prompts_path, prompt_file=file)
-            print(index, ":", prompts[index]['name'])
-        input_prompt = input("Type the number of the prompt you want to use or 'q' to quit: ")
+            print(index, ":", prompts[index]["name"])
+        input_prompt = input(
+            "Type the number of the prompt you want to use or 'q' to quit: "
+        )
         if input_prompt == "q":
             return False
         try:
@@ -162,6 +191,7 @@ Process with AI this prompt: {ohmyscrapper_texts}
             prompt = _get_prompt()
     return prompt
 def _parse_prompt(prompts_path, prompt_file):
     prompt = {}
     raw_prompt = open(f"{prompts_path}/{prompt_file}", "r").read().split("---")
@@ -170,6 +200,8 @@ def _parse_prompt(prompts_path, prompt_file):
     prompt["prompt_file"] = prompt_file
     return prompt
 # TODO: Separate gemini from basic function
 def _process_with_gemini(model, instructions):
     response = """"""

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/modules/scrap_urls.py RENAMED Viewed

@@ -7,72 +7,87 @@ import time
 import random
-def process_linkedin_redirect(url_report, url):
-    print("linkedin_redirect")
+def process_linkedin_redirect(url_report, url, verbose=False):
+    if verbose:
+        print("linkedin_redirect")
     if url_report["total-a-links"] < 5:
         if "first-a-link" in url_report.keys():
             url_destiny = url_report["first-a-link"]
         else:
             urls_manager.set_url_error(url=url["url"], value="error: no first-a-link")
-            print("no url for:", url["url"])
+            if verbose:
+                print("no url for:", url["url"])
             return
     else:
         if "og:url" in url_report.keys():
             url_destiny = url_report["og:url"]
         else:
             urls_manager.set_url_error(url=url["url"], value="error: no og:url")
-            print("no url for:", url["url"])
+            if verbose:
+                print("no url for:", url["url"])
             return
-    print(url["url"], ">>", url_destiny)
+    if verbose:
+        print(url["url"], ">>", url_destiny)
     urls_manager.add_url(url=url_destiny)
     urls_manager.set_url_destiny(url=url["url"], destiny=url_destiny)
-def process_linkedin_feed(url_report, url):
-    print("linkedin_feed")
+def process_linkedin_feed(url_report, url, verbose=False):
+    if verbose:
+        print("linkedin_feed")
     if "og:url" in url_report.keys():
         url_destiny = url_report["og:url"]
     else:
         urls_manager.set_url_error(url=url["url"], value="error: no og:url")
-        print("no url for:", url["url"])
+        if verbose:
+            print("no url for:", url["url"])
         return
-    print(url["url"], ">>", url_destiny)
+    if verbose:
+        print(url["url"], ">>", url_destiny)
     urls_manager.add_url(url=url_destiny)
     urls_manager.set_url_destiny(url=url["url"], destiny=url_destiny)
-def process_linkedin_job(url_report, url):
-    print("linkedin_job")
+def process_linkedin_job(url_report, url, verbose=False):
+    if verbose:
+        print("linkedin_job")
     changed = False
     if "h1" in url_report.keys():
-        print(url["url"], ": ", url_report["h1"])
+        if verbose:
+            print(url["url"], ": ", url_report["h1"])
         urls_manager.set_url_h1(url=url["url"], value=url_report["h1"])
         changed = True
     elif "og:title" in url_report.keys():
-        print(url["url"], ": ", url_report["og:title"])
+        if verbose:
+            print(url["url"], ": ", url_report["og:title"])
         urls_manager.set_url_h1(url=url["url"], value=url_report["og:title"])
         changed = True
     if "description" in url_report.keys():
-        urls_manager.set_url_description(url=url["url"], value=url_report["description"])
+        urls_manager.set_url_description(
+            url=url["url"], value=url_report["description"]
+        )
         changed = True
     elif "og:description" in url_report.keys():
-        urls_manager.set_url_description(url=url["url"], value=url_report["og:description"])
+        urls_manager.set_url_description(
+            url=url["url"], value=url_report["og:description"]
+        )
         changed = True
     if not changed:
         urls_manager.set_url_error(url=url["url"], value="error: no h1 or description")
-def process_linkedin_post(url_report, url):
-    print("linkedin_post or generic")
-    print(url["url"])
+def process_linkedin_post(url_report, url, verbose=False):
+    if verbose:
+        print("linkedin_post or generic")
+        print(url["url"])
     changed = False
     if "h1" in url_report.keys():
-        print(url["url"], ": ", url_report["h1"])
+        if verbose:
+            print(url["url"], ": ", url_report["h1"])
         urls_manager.set_url_h1(url=url["url"], value=url_report["h1"])
         changed = True
     elif "og:title" in url_report.keys():
@@ -88,52 +103,50 @@ def process_linkedin_post(url_report, url):
     if description is not None:
         urls_manager.set_url_description(url=url["url"], value=description)
-        description_links = load_txt.put_urls_from_string(text_to_process=description, parent_url=url["url"])
+        description_links = load_txt.put_urls_from_string(
+            text_to_process=description, parent_url=url["url"]
+        )
         urls_manager.set_url_description_links(url=url["url"], value=description_links)
     if not changed:
         urls_manager.set_url_error(url=url["url"], value="error: no h1 or description")
-def scrap_url(url):
-    # TODO: Use get_urls_valid_prefix_by_id()
-    df = urls_manager.get_urls_valid_prefix()
+def scrap_url(url, verbose=False):
     # TODO: Need to change this
     if url["url_type"] is None:
-        print("\n\ngeneric:", url["url"])
+        if verbose:
+            print("\n\ngeneric:", url["url"])
         url["url_type"] = "generic"
     else:
-        print("\n\n", url["url_type"] + ":", url["url"])
+        if verbose:
+            print("\n\n", url["url_type"] + ":", url["url"])
     try:
         url_report = sniff_url.get_tags(url=url["url"])
     except Exception as e:
         urls_manager.set_url_error(url=url["url"], value="error")
         urls_manager.touch_url(url=url["url"])
-        print("\n\n!!! ERROR FOR:", url["url"])
-        print(
-            "\n\n!!! you can check the URL using the command sniff-url",
-            url["url"],
-            "\n\n",
-        )
+        if verbose:
+            print("\n\n!!! ERROR FOR:", url["url"])
+            print(
+                "\n\n!!! you can check the URL using the command sniff-url",
+                url["url"],
+                "\n\n",
+            )
         return
-    # linkedin_redirect - linkedin (https://lnkd.in/)
     if url["url_type"] == "linkedin_redirect":
-        process_linkedin_redirect(url_report=url_report, url=url)
+        process_linkedin_redirect(url_report=url_report, url=url, verbose=verbose)
-    # linkedin_feed - linkedin (https://%.linkedin.com/feed/)
     if url["url_type"] == "linkedin_feed":
-        process_linkedin_feed(url_report=url_report, url=url)
+        process_linkedin_feed(url_report=url_report, url=url, verbose=verbose)
-    # linkedin_job - linkedin (https://www.linkedin.com/jobs/)
     if url["url_type"] == "linkedin_job":
-        process_linkedin_job(url_report=url_report, url=url)
+        process_linkedin_job(url_report=url_report, url=url, verbose=verbose)
-    # linkedin_job - linkedin (https://www.linkedin.com/jobs/)
     if url["url_type"] == "linkedin_post" or url["url_type"] == "generic":
-        process_linkedin_post(url_report=url_report, url=url)
+        process_linkedin_post(url_report=url_report, url=url, verbose=verbose)
     urls_manager.set_url_json(url=url["url"], value=url_report["json"])
     urls_manager.touch_url(url=url["url"])
@@ -144,35 +157,53 @@ def isNaN(num):
 def scrap_urls(
-    recursive=False, ignore_valid_prefix=False, randomize=False, only_parents=True
+    recursive=False,
+    ignore_valid_prefix=False,
+    randomize=False,
+    only_parents=True,
+    verbose=False,
+    n_urls=0,
 ):
+    limit = 10
     classify_urls.classify_urls()
     urls = urls_manager.get_untouched_urls(
         ignore_valid_prefix=ignore_valid_prefix,
         randomize=randomize,
         only_parents=only_parents,
+        limit=limit,
     )
     if len(urls) == 0:
-        print("no urls to scrap")
+        print("📭 no urls to scrap")
+        if n_urls > 0:
+            print(f"-- 🗃️ {n_urls} scraped urls in total...")
+            print("scrapping is over...")
         return
     for index, url in urls.iterrows():
-        scrap_url(url)
-        wait = random.randint(15, 20)
         wait = random.randint(1, 3)
-        print("sleeping for", wait, "seconds")
+        print(
+            "🐶 Scrapper is sleeping for", wait, "seconds before scraping next url..."
+        )
         time.sleep(wait)
+        print("🐕 Scrapper is sniffing the url...")
+        scrap_url(url=url, verbose=verbose)
+    n_urls = n_urls + len(urls)
+    print(f"-- 🗃️ {n_urls} scraped urls...")
     classify_urls.classify_urls()
     if recursive:
         wait = random.randint(5, 10)
-        print("sleeping for", wait, "seconds before next round")
+        print(
+            f"🐶 Scrapper is sleeping for {wait} seconds before next round of {limit} urls"
+        )
         time.sleep(wait)
         scrap_urls(
             recursive=recursive,
             ignore_valid_prefix=ignore_valid_prefix,
             randomize=randomize,
             only_parents=only_parents,
+            verbose=verbose,
+            n_urls=n_urls,
         )
     else:
-        print("ending...")
+        print("scrapping is over...")

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/modules/seed.py RENAMED Viewed

@@ -3,5 +3,5 @@ import ohmyscrapper.models.urls_manager as urls_manager
 def seed():
     urls_manager.seeds()
-    print("db seeded")
+    print("🫒 db seeded")
     return

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/modules/show.py RENAMED Viewed

@@ -1,10 +1,13 @@
 import ohmyscrapper.models.urls_manager as urls_manager
 import math
+import os
 from rich.console import Console
 from rich.table import Table
 def export_urls(limit=0, csv_file="output/urls.csv", simplify=False):
+    if not os.path.exists("output"):
+        os.mkdir("output")
     df = urls_manager.get_urls(limit=limit)
     if simplify:
@@ -12,7 +15,7 @@ def export_urls(limit=0, csv_file="output/urls.csv", simplify=False):
     df.to_csv(csv_file, index=False)
     print("--------------------")
-    print("Urls exported to", csv_file)
+    print("📊🖋️ Urls exported to", csv_file)
     df.replace(
         {
@@ -22,17 +25,19 @@ def export_urls(limit=0, csv_file="output/urls.csv", simplify=False):
         inplace=True,
     )
     df.to_html(csv_file + "-preview.html", index=False)
-    print("Urls preview exported to", csv_file + "-preview.html")
+    print("📜🖋️ Urls preview exported to", csv_file + "-preview.html")
     print("--------------------")
 def export_report(csv_file="output/report.csv"):
+    if not os.path.exists("output"):
+        os.mkdir("output")
     df = urls_manager.get_urls_report()
     df.to_csv(csv_file, index=False)
     _clear_file(csv_file)
     print("--------------------")
-    print("Urls report exported to", csv_file)
+    print("📊🖋️ Urls report exported to", csv_file)
     df.replace(
         {
@@ -44,9 +49,10 @@ def export_report(csv_file="output/report.csv"):
     df.to_html(csv_file + "-preview.html", index=False)
     _clear_file(csv_file + "-preview.html")
-    print("Urls report preview exported to", csv_file + "-preview.html")
+    print("📜🖋️ Urls report preview exported to", csv_file + "-preview.html")
     print("--------------------")
 # TODO: Add transformation layer
 def _clear_file(txt_tile):
     with open(txt_tile, "r") as f:
@@ -56,6 +62,7 @@ def _clear_file(txt_tile):
         with open(txt_tile, "w") as f:
             f.write(content)
 def show_urls(limit=0, jump_to_page=0):
     df = urls_manager.get_urls(limit=limit)
     df.drop(columns=["json", "description"], inplace=True)

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/modules/untouch_all.py RENAMED Viewed

@@ -3,5 +3,5 @@ import ohmyscrapper.models.urls_manager as urls_manager
 def untouch_all():
     urls_manager.untouch_all_urls()
-    print("urls have been untouched")
+    print("🙌 urls have been untouched")
     return

ohmyscrapper-0.2.3/src/ohmyscrapper/modules/classify_urls.py DELETED Viewed

@@ -1,23 +0,0 @@
-import ohmyscrapper.models.urls_manager as urls_manager
-import pandas as pd
-import time
-def classify_urls(recursive=False):
-    urls_manager.seeds()
-    df = urls_manager.get_urls_valid_prefix()
-    keep_alive = True
-    while keep_alive:
-        print("waking up!")
-        for index, row_prefix in df.iterrows():
-            df_urls = urls_manager.get_url_like_unclassified(like_condition=row_prefix["url_prefix"])
-            for index, row_urls in df_urls.iterrows():
-                urls_manager.set_url_type_by_id(url_id =row_urls["id"], url_type=row_prefix["url_type"])
-        if not recursive:
-            print("ending...")
-            keep_alive = False
-        else:
-            print("sleeping...")
-            time.sleep(10)

ohmyscrapper-0.2.3/src/ohmyscrapper/modules/load_txt.py DELETED Viewed

@@ -1,32 +0,0 @@
-import os
-from urlextract import URLExtract
-import ohmyscrapper.models.urls_manager as urls_manager
-def load_txt(file_name="input/_chat.txt"):
-    if not os.path.exists("input"):
-        os.mkdir("input")
-    urls_manager.create_tables()
-    urls_manager.seeds()
-    # make it recursive for all files
-    text_file_content = open(file_name, "r").read()
-    put_urls_from_string(text_to_process=text_file_content)
-    # move_it_to_processed
-    print("--------------------")
-    print(file_name, "processed")
-def put_urls_from_string(text_to_process, parent_url=None):
-    if isinstance(text_to_process, str):
-        extractor = URLExtract()
-        for url in extractor.find_urls(text_to_process):
-            urls_manager.add_url(url=url, parent_url=parent_url)
-            print(url, "added")
-        return len(extractor.find_urls(text_to_process))
-    else:
-        return 0

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/__main__.py RENAMED Viewed

File without changes

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/modules/merge_dbs.py RENAMED Viewed

File without changes

{ohmyscrapper-0.2.3 → ohmyscrapper-0.4.0}/src/ohmyscrapper/modules/sniff_url.py RENAMED Viewed

File without changes

ohmyscrapper 0.2.3__tar.gz → 0.4.0__tar.gz

ohmyscrapper 0.2.3tar.gz → 0.4.0tar.gz