PyPI - warn-scraper - Versions diffs - 1.2.88__py3-none-any.whl → 1.2.89__py3-none-any.whl - Mend

warn-scraper 1.2.88py3-none-any.whl → 1.2.89py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

warn/cache.py CHANGED Viewed

@@ -69,7 +69,7 @@ class Cache:
         """
         path = Path(self.path, name)
         logger.debug(f"Reading CSV from cache {path}")
-        with open(path) as fh:
+        with open(path, encoding="utf-8") as fh:
             return list(csv.reader(fh))
     def download(

warn/scrapers/va.py CHANGED Viewed

@@ -1,13 +1,20 @@
+import datetime
 import logging
+import os
+from glob import glob
 from pathlib import Path
+from shutil import copyfile
+from time import sleep
+from selenium import webdriver
+from selenium.webdriver.chrome.options import Options as ChromeOptions
+from selenium.webdriver.chrome.service import Service as ChromeService
+from webdriver_manager.chrome import ChromeDriverManager
 from .. import utils
 from ..cache import Cache
-# from bs4 import BeautifulSoup, Tag
-__authors__ = ["zstumgoren", "Dilcia19", "shallotly"]
+__authors__ = ["zstumgoren", "Dilcia19", "shallotly", "stucka"]
 __tags__ = ["html", "csv"]
 __source__ = {
     "name": "Virginia Employment Commission",
@@ -30,37 +37,127 @@ def scrape(
     Returns: the Path where the file is written
     """
-    # This scraper initially tried to get a CSV download link that was only for the most recent entries. The scraping part of that broke.
-    # It's now hard-coded to a particular download link with parameters that should get the full thing.
+    cache = Cache(cache_dir)
+    csv_url = "https://vec.virginia.gov/warn-notices-csv.csv"
+    """
+    This scraper originally tried to parse HTML to find a CSV download link.
+    The HTML scraping portion broke in early December 2024. The code had
+    also been downloading an incomplete slice of the data.
+    In late December 2024, everything broke because Virginia decided to begin
+    testing for Javascript-aware browsers. This code is the way it is because
+    every alternative considered was somehow worse. Not helping? Losing about
+    four hours of work including the extensive documentation on the
+    alternatives sought.
+    Virginia's protections require a JS-aware browser to evaluate some
+    obscurred, frequently changing code to set some short-lived cookies.
+    Without those cookies, no code. And even headless browsers get blocked
+    by a video test. Really unfun. So a ... headed? ... JS-aware browser
+    is required.
+    Some things evaluated included, off memory:
+    -- Using Playwright instead. This looked like a reasonable approach but
+        was awful resource-wise. Playwright itself had significant overhead,
+        partially from requiring its own version of browsers to be installed.
+        There's apparently some way with YAML to try to get Github Actions,
+        where this project is in production, to install only for particular
+        branches. Without that code, this'd be pending a couple minutes
+        several times a day on each of about 40 different branches of code.
+    -- Using Selenium. This is where it ultimately landed. It's not great,
+        but after trying about a dozen alernatives it's the best we got.
+    -- Installation code for Chrome's driver started acting flaky between
+        platforms.
+    -- PhantomJS couldn't even get past the first brush with the protection.
+    -- The optimal file is the CSV created by the state with well-defined
+        fields. Unfortunately, hitting the link once approved by the
+        Javascript results in an immediate download. There's no regular way
+        to get the file path through Javascript. Backdoor efforts like trying
+        to go through the Download menu also failed, because Chrome puts
+        them into a Shadow DOM. Several hunks of code to try to access the
+        Shadow DOM and get at the local filename are no longer functional
+        in Chrome. Building an extension to track some of this ... is not
+        an option, and loading it the first time would require human
+        intervention rather than automation. There might be a way to mess
+        with the Shadow DOM through CSS manipulation, but that looked to
+        weird to bother trying especially given other more reasonable measures
+        that no longer worked.
+    -- Also, efforts to get at the CSV through view-source failed.
+    -- And it's possible to scrape the HTML and try to parse it back out for
+        what warn-scraper needs, but that seemed even more fraught than trying
+        to get the CSV.
+    -- So if the filename isn't obtainable through Chrome, where do we get it?
+        There's a multiplatform way to get at a user's home directory. For
+        many people Downloads is off there, at ... ~/Downloads, capital D,
+        plural. Except people can configure that differently. And most
+        languages won't call it Downloads. And Chrome of course lets people
+        set a default download location that can be anywhere else, or select
+        a per-file location ("Ask me where to save this" or some such).
+        After going down even more rabbit holes, ... ~/Downloads is all that
+        gets implemented here.
+    -- I tried to see if Firefox might be a little less grumpy. One Python
+        driver-finder got one day of commits. A fork has Issues turned off
+        somehow. The third one I looked at was the one that was grumpy for
+        Chrome, and its maintainer is apparently trying to protect his
+        homeland with FPV drones. So ... back to Chrome.
+    So, yes, this is a weird implementation. It's a terrible model. It's
+    even got a hard-coded wait. At least as of late December 2024, however,
+    it does work. ... in late December 2024.
+    """
-    # This may break again, but this revised attempt has far fewer moving parts and actually fetches the complete data set.
-    # Blame Stucka in December 2024.
+    # driver = webdriver.Chrome(options=chromeoptionsholder, service=Service(ChromeDriverManager().install()))
+    logger.debug("Attempting to launch Chrome")
+    chromeoptionsholder = ChromeOptions()
+    chrome_install = ChromeDriverManager().install()
-    # Get the WARN page
-    # url = "https://www.vec.virginia.gov/warn-notices"
-    # url = "https://vec.virginia.gov/warn-notices?field_notice_date_value%5Bmin%5D%5Bdate%5D=1%2F1%2F1990&field_notice_date_value%5Bmax%5D%5Bdate%5D=&field_region_warn_tid=All"
-    # r = utils.get_url(url, verify=True)
-    # html = r.text
+    # Weird error with finding the driver name in Windows. Sometimes.
+    if chrome_install.endswith("THIRD_PARTY_NOTICES.chromedriver"):
+        chrome_install = chrome_install.replace(
+            "THIRD_PARTY_NOTICES.chromedriver", "chromedriver.exe"
+        )
+    logger.debug(f"Chrome install variable is {chrome_install}")
+    # folder = os.path.dirname(chrome_install)
+    # chromedriver_path = folder #  os.path.join(folder, "chromedriver.exe")
+    # service = ChromeService(chromedriver_path)
+    service = ChromeService(chrome_install)
+    driver = webdriver.Chrome(options=chromeoptionsholder, service=service)
+    logger.debug(f"Attempting to fetch {csv_url}")
+    driver.get(csv_url)
-    # Save it to the cache
-    cache = Cache(cache_dir)
-    # cache.write("va/source.html", html)
+    sleep(30)  # Give it plenty of time to evaluate Javascript
+    download_dir = os.path.expanduser("~") + "/Downloads"
+    if not os.path.isdir(download_dir):
+        logger.error(f"The download directory is not {download_dir}.")
+    # get the list of files
+    list_of_files = glob(download_dir + "/warn-notices-csv*.csv")
+    if len(list_of_files) == 0:
+        logger.error(f"No matching files found in {download_dir}.")
+    # get the latest file name
+    latest_file = max(list_of_files, key=os.path.getctime)
+    latest_file_time = datetime.datetime.fromtimestamp(os.path.getctime(latest_file))
+    # print the latest file name
+    logger.debug(f"CSV saved to {latest_file}, saved at {latest_file_time}")
+    target_filename = cache_dir / "va" / "source.csv"
+    utils.create_directory(path=cache_dir / "va", is_file=False)
-    # Parse out the CSV download link
-    # soup = BeautifulSoup(html, "html.parser")
-    # csv_link = soup.find("a", text="Download")
-    # if isinstance(csv_link, Tag):
-    #     csv_href = csv_link["href"]
-    # else:
-    #     raise ValueError("Could not find CSV link")
+    logger.debug(f"Saving file to {target_filename}")
-    # csv_href = "/warn-notices-csv.csv?"
-    # csv_url = f"https://www.vec.virginia.gov{csv_href}"
+    copyfile(latest_file, target_filename)
-    csv_url = "https://vec.virginia.gov/warn-notices-csv.csv?field_notice_date_value%5Bmin%5D%5Bdate%5D=1%2F1%2F1990&field_notice_date_value%5Bmax%5D%5Bdate%5D=&field_region_warn_tid=All"
+    driver.quit()
     # Download it to the cache
-    cache.download("va/source.csv", csv_url, verify=True)
+    # cache.download("va/source.csv", csv_url, verify=True)
     # Open it up as a list of rows
     csv_rows = cache.read_csv("va/source.csv")

warn/utils.py CHANGED Viewed

@@ -86,7 +86,7 @@ def save_if_good_url(filename, url, **kwargs):
         success_flag = False
         content = False
     else:
-        with open(filename, "wb") as outfile:
+        with open(filename, "wb", encoding="utf-8") as outfile:
             outfile.write(response.content)
             success_flag = True
             content = response.content
@@ -104,7 +104,7 @@ def write_rows_to_csv(output_path: Path, rows: list, mode="w"):
     """
     create_directory(output_path, is_file=True)
     logger.debug(f"Writing {len(rows)} rows to {output_path}")
-    with open(output_path, mode, newline="") as f:
+    with open(output_path, mode, newline="", encoding="utf-8") as f:
         writer = csv.writer(f)
         writer.writerows(rows)

{warn_scraper-1.2.88.dist-info → warn_scraper-1.2.89.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: warn-scraper
-Version: 1.2.88
+Version: 1.2.89
 Summary: Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
 Home-page: https://github.com/biglocalnews/warn-scraper
 Author: Big Local News
@@ -28,9 +28,11 @@ Requires-Dist: html5lib
 Requires-Dist: pdfplumber
 Requires-Dist: requests
 Requires-Dist: openpyxl
-Requires-Dist: xlrd
-Requires-Dist: tenacity
 Requires-Dist: retry
+Requires-Dist: selenium
+Requires-Dist: tenacity
+Requires-Dist: xlrd
+Requires-Dist: webdriver-manager
 ## Links

{warn_scraper-1.2.88.dist-info → warn_scraper-1.2.89.dist-info}/RECORD RENAMED Viewed

@@ -14,10 +14,10 @@ tests/cassettes/test_scrape_integration.yaml,sha256=5JfS-nscabP0rDhUeBIXMIFVSS5q
 tests/fixtures/2021_page_1.html,sha256=ZIPBhPE2BTcJX-mm5_4M4pgQcnrQWDBuGrEJDonj2QE,34
 tests/fixtures/2021_page_2.html,sha256=qm6lX8LwFRNT2WIIW2U29ku3wEGzvECzQJCWBtcwSbg,34
 warn/__init__.py,sha256=A07JFY1TyaPtVIndBa7IvTk13DETqIkLgRdk0A-MCoE,85
-warn/cache.py,sha256=K6K27LfHqZ2m4_fU18LjVskYsqw9aojcychgUXUkOLU,5502
+warn/cache.py,sha256=hyta04_G-ALGwcKl4xNc7EgHS_xklyVD5d8SXNrJekY,5520
 warn/cli.py,sha256=ZqyJwICdHFkn2hEgbArj_upbElR9-TSDlYDqyEGeexE,2019
 warn/runner.py,sha256=oeGRybGwpnkQKlPzRMlKxhsDt1GN4PZoX-vUwrsPgos,1894
-warn/utils.py,sha256=eoY-6VhYLTx22hTlcTZiKGMy8OH7BVkUS4rAmVEUsEA,7026
+warn/utils.py,sha256=V1JQD-bPwNiZ8kpl_YsonfjtaF1a8M8jlBNbdwGXcq4,7062
 warn/platforms/__init__.py,sha256=wIZRDf4tbTuC8oKM4ZrTAtwNgbtMQGzPXMwDYCFyrog,81
 warn/platforms/job_center/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 warn/platforms/job_center/cache.py,sha256=yhA3sE46lNFg8vEewSoRYVByi0YSlkBiKm7qoSUiTdM,1868
@@ -61,13 +61,13 @@ warn/scrapers/sd.py,sha256=_4R19Ybzsyx1PvcWV3_laJmJ3etrwVGfhNEQm6njwoA,1904
 warn/scrapers/tn.py,sha256=i1H7c09Ea3CDrTXqqRMLBMPT_34QtGA0-x7T8rm_j5Q,2945
 warn/scrapers/tx.py,sha256=watfR1gyN9w7nluiAOnnIghEmoq3eShNUzYSZ8SkZy4,4438
 warn/scrapers/ut.py,sha256=iUh38YIjbvv5MyyKacsiZNe8KjfdBeDaOf-qMQEF_kc,2245
-warn/scrapers/va.py,sha256=-QRIMPVIhBGDiKQOaMwwZbPtJxd1S2QwYX4Zxq1NNt0,2549
+warn/scrapers/va.py,sha256=jp5G9Z73s5j9A3-1IybFV0rmZSBWH73vNrQpC7XLSSU,7573
 warn/scrapers/vt.py,sha256=d-bo4WK2hkrk4BhCCmLpEovcoZltlvdIUB6O0uaMx5A,1186
 warn/scrapers/wa.py,sha256=UXdVtHZo_a-XfoiyOooTRfTb9W3PErSZdKca6SRORgs,4282
 warn/scrapers/wi.py,sha256=ClEzXkwZbop0W4fkQgsb5oHAPUrb4luUPGV-jOKwkcg,4855
-warn_scraper-1.2.88.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
-warn_scraper-1.2.88.dist-info/METADATA,sha256=-30KqUCkeCjeAa13tyECWG9PCHBxQhSPdbbhQ1rZGF4,2036
-warn_scraper-1.2.88.dist-info/WHEEL,sha256=Wyh-_nZ0DJYolHNn1_hMa4lM7uDedD_RGVwbmTjyItk,91
-warn_scraper-1.2.88.dist-info/entry_points.txt,sha256=poh_oSweObGlBSs1_2qZmnTodlOYD0KfO7-h7W2UQIw,47
-warn_scraper-1.2.88.dist-info/top_level.txt,sha256=gOhHgNEkrUvajlzoKkVOo-TlQht9MoXnKOErjzqLGHo,11
-warn_scraper-1.2.88.dist-info/RECORD,,
+warn_scraper-1.2.89.dist-info/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
+warn_scraper-1.2.89.dist-info/METADATA,sha256=5iDNnLM7c0Z6kpCF4-QB-6dbZWCkmPUnEw7FG6npfPo,2093
+warn_scraper-1.2.89.dist-info/WHEEL,sha256=Wyh-_nZ0DJYolHNn1_hMa4lM7uDedD_RGVwbmTjyItk,91
+warn_scraper-1.2.89.dist-info/entry_points.txt,sha256=poh_oSweObGlBSs1_2qZmnTodlOYD0KfO7-h7W2UQIw,47
+warn_scraper-1.2.89.dist-info/top_level.txt,sha256=gOhHgNEkrUvajlzoKkVOo-TlQht9MoXnKOErjzqLGHo,11
+warn_scraper-1.2.89.dist-info/RECORD,,

{warn_scraper-1.2.88.dist-info → warn_scraper-1.2.89.dist-info}/LICENSE RENAMED Viewed

File without changes

{warn_scraper-1.2.88.dist-info → warn_scraper-1.2.89.dist-info}/WHEEL RENAMED Viewed

File without changes

{warn_scraper-1.2.88.dist-info → warn_scraper-1.2.89.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{warn_scraper-1.2.88.dist-info → warn_scraper-1.2.89.dist-info}/top_level.txt RENAMED Viewed

File without changes

warn-scraper 1.2.88__py3-none-any.whl → 1.2.89__py3-none-any.whl

warn-scraper 1.2.88py3-none-any.whl → 1.2.89py3-none-any.whl