warn-scraper 1.2.58__tar.gz → 1.2.60__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {warn-scraper-1.2.58/warn_scraper.egg-info → warn-scraper-1.2.60}/PKG-INFO +1 -1
- warn-scraper-1.2.60/warn/scrapers/hi.py +133 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/md.py +2 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60/warn_scraper.egg-info}/PKG-INFO +1 -1
- warn-scraper-1.2.58/warn/scrapers/hi.py +0 -98
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/.devcontainer/devcontainer.json +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/.github/dependabot.yml.disabled +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/.github/workflows/continuous-deployment.yml +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/.gitignore +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/.pre-commit-config.yaml +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/LICENSE +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/MANIFEST.in +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/Makefile +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/Pipfile +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/Pipfile.lock +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/README.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/Makefile +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/R42693.pdf +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/gao-03-1003.pdf +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-actions-finished.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-actions-start.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-changelog-button.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-changelog-entered.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-draft-button.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-name-release.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-name-tag.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-publish-button.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-pypi.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-release-published.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-releases-button.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_static/releasing-tag-button.png +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/_templates/sources.md.tmpl +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/conf.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/contributing.rst +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/index.rst +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/make.bat +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/reference.rst +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/releasing.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/requirements.txt +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/al.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/az.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/ca.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/co.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/dc.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/de.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/ia.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/in.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/job_center.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/ks.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/md.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/me.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/mo.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/ny.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/ok.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/or.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/sc.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/tx.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/ut.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/va.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/vt.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/scrapers/wi.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/sources.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/docs/usage.md +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/setup.cfg +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/setup.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/__init__.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/cassettes/test_cached_detail_pages.yaml +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/cassettes/test_cached_search_results.yaml +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/cassettes/test_delete.yaml +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/cassettes/test_missing_detail_page_values.yaml +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/cassettes/test_no_results.yaml +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/cassettes/test_paged_results.yaml +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/cassettes/test_scrape_integration.yaml +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/conftest.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/fixtures/2021_page_1.html +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/fixtures/2021_page_2.html +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/test_cache.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/test_delete.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/test_job_center.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/test_job_center_cache.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/test_openpyxl.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/__init__.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/cache.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/cli.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/platforms/__init__.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/platforms/job_center/__init__.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/platforms/job_center/cache.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/platforms/job_center/site.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/platforms/job_center/urls.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/platforms/job_center/utils.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/runner.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/__init__.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ak.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/al.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/az.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ca.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/co.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ct.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/dc.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/de.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/fl.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ga.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ia.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/id.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/il.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/in.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ks.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ky.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/la.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/me.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/mi.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/mo.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/mt.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ne.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/nj.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/nm.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ny.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/oh.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ok.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/or.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ri.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/sc.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/sd.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/tn.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/tx.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/ut.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/va.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/vt.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/wa.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/scrapers/wi.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn/utils.py +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn_scraper.egg-info/SOURCES.txt +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn_scraper.egg-info/dependency_links.txt +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn_scraper.egg-info/entry_points.txt +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn_scraper.egg-info/not-zip-safe +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn_scraper.egg-info/requires.txt +0 -0
- {warn-scraper-1.2.58 → warn-scraper-1.2.60}/warn_scraper.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.1
|
2
2
|
Name: warn-scraper
|
3
|
-
Version: 1.2.
|
3
|
+
Version: 1.2.60
|
4
4
|
Summary: Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
|
5
5
|
Home-page: https://github.com/biglocalnews/warn-scraper
|
6
6
|
Author: Big Local News
|
@@ -0,0 +1,133 @@
|
|
1
|
+
import datetime
|
2
|
+
import logging
|
3
|
+
from pathlib import Path
|
4
|
+
from time import sleep
|
5
|
+
from urllib.parse import quote
|
6
|
+
|
7
|
+
from bs4 import BeautifulSoup
|
8
|
+
|
9
|
+
from .. import utils
|
10
|
+
|
11
|
+
__authors__ = ["Ash1R", "stucka"]
|
12
|
+
__tags__ = ["html", "pdf"]
|
13
|
+
__source__ = {
|
14
|
+
"name": "Workforce Development Hawaii",
|
15
|
+
"url": "https://labor.hawaii.gov/wdc/real-time-warn-updates/",
|
16
|
+
}
|
17
|
+
|
18
|
+
logger = logging.getLogger(__name__)
|
19
|
+
|
20
|
+
|
21
|
+
def scrape(
|
22
|
+
data_dir: Path = utils.WARN_DATA_DIR,
|
23
|
+
cache_dir: Path = utils.WARN_CACHE_DIR,
|
24
|
+
) -> Path:
|
25
|
+
"""
|
26
|
+
Scrape data from Hawaii.
|
27
|
+
|
28
|
+
Keyword arguments:
|
29
|
+
data_dir -- the Path were the result will be saved (default WARN_DATA_DIR)
|
30
|
+
cache_dir -- the Path where results can be cached (default WARN_CACHE_DIR)
|
31
|
+
Returns: the Path where the file is written
|
32
|
+
"""
|
33
|
+
# Google Cache is a backup if the state re-implements its JS-enabled browser equivalent
|
34
|
+
usegooglecache = False
|
35
|
+
cacheprefix = "https://webcache.googleusercontent.com/search?q=cache%3A"
|
36
|
+
|
37
|
+
firstpageurl = "https://labor.hawaii.gov/wdc/real-time-warn-updates/"
|
38
|
+
if usegooglecache:
|
39
|
+
firstpageurl = cacheprefix + quote(firstpageurl)
|
40
|
+
|
41
|
+
firstpage = utils.get_url(firstpageurl)
|
42
|
+
soup = BeautifulSoup(firstpage.text, features="html5lib")
|
43
|
+
pagesection = soup.select("div.primary-content")[0]
|
44
|
+
subpageurls = []
|
45
|
+
for atag in pagesection.find_all("a"):
|
46
|
+
href = atag["href"]
|
47
|
+
if href.endswith("/"):
|
48
|
+
href = href # [:-1]
|
49
|
+
subpageurl = href
|
50
|
+
if usegooglecache:
|
51
|
+
subpageurl = cacheprefix + quote(subpageurl)
|
52
|
+
subpageurls.append(subpageurl)
|
53
|
+
|
54
|
+
masterlist = []
|
55
|
+
headers = ["Company", "Date", "PDF url", "location", "jobs"]
|
56
|
+
# data = [headers]
|
57
|
+
# lastdateseen = "2099-12-31"
|
58
|
+
|
59
|
+
for subpageurl in reversed(subpageurls):
|
60
|
+
sleep(2)
|
61
|
+
# Conditionally here, we want to check and see if we have the old cached files, or if the year is current or previous.
|
62
|
+
# Only need to download if it's current or previous year.
|
63
|
+
# But do we care enough to implement right now?
|
64
|
+
|
65
|
+
logger.debug(f"Parsing page {subpageurl}")
|
66
|
+
page = utils.get_url(subpageurl)
|
67
|
+
soup = BeautifulSoup(page.text, features="html5lib")
|
68
|
+
if subpageurl.endswith("/"):
|
69
|
+
subpageurl = subpageurl[:-1] # Trim off the final slash, if there is one
|
70
|
+
pageyear = subpageurl.split("/")[-1][:4]
|
71
|
+
|
72
|
+
# There are at least two formats for Hawaii. In some years, each individual layoff is in a paragraph tag.
|
73
|
+
# In others, all the layoffs are grouped under a single paragraph tag, separated by <br>
|
74
|
+
# BeautifulSoup converts that to a <br/>.
|
75
|
+
# But the call to parent also repeats a bunch of entries, so we need to ensure they're not.
|
76
|
+
# So in more recent years, finding the parent of the "p a" there find essentially the row of data.
|
77
|
+
# In the older years, the parent is ... all the rows of data, which gets repeated.
|
78
|
+
# So take each chunk of data, find the parent, do some quality checks, clean up the text,
|
79
|
+
# don't engage with duplicates.
|
80
|
+
|
81
|
+
selection = soup.select("p a[href*=pdf]")
|
82
|
+
rows = []
|
83
|
+
for child in selection:
|
84
|
+
parent = child.parent
|
85
|
+
for subitem in parent.prettify().split("<br/>"):
|
86
|
+
if len(subitem.strip()) > 5 and ".pdf" in subitem:
|
87
|
+
subitem = subitem.replace("\xa0", " ").replace("\n", "").strip()
|
88
|
+
row = BeautifulSoup(subitem, features="html5lib")
|
89
|
+
if row not in rows:
|
90
|
+
rows.append(row)
|
91
|
+
|
92
|
+
for row in rows:
|
93
|
+
line: dict = {}
|
94
|
+
for item in headers:
|
95
|
+
line[item] = None
|
96
|
+
graftext = row.get_text().strip()
|
97
|
+
tempdate = graftext
|
98
|
+
|
99
|
+
# Check to see if it's not an amendment, doesn't have 3/17/2022 date format
|
100
|
+
# Most dates should be like "March 17, 2022"
|
101
|
+
if pageyear in tempdate and f"/{pageyear}" not in tempdate:
|
102
|
+
try:
|
103
|
+
tempdate = (
|
104
|
+
graftext.strip().split(pageyear)[0].strip() + f" {pageyear}"
|
105
|
+
)
|
106
|
+
except ValueError:
|
107
|
+
print(f"Date conversion failed on row: {row}")
|
108
|
+
|
109
|
+
line["Date"] = tempdate
|
110
|
+
|
111
|
+
try:
|
112
|
+
parsed_date = datetime.datetime.strptime(
|
113
|
+
tempdate, "%B %d, %Y"
|
114
|
+
).strftime("%Y-%m-%d")
|
115
|
+
line["Date"] = parsed_date
|
116
|
+
except ValueError:
|
117
|
+
logger.debug(f"Date error: '{tempdate}', leaving intact")
|
118
|
+
|
119
|
+
line["PDF url"] = row.select("a")[0].get("href")
|
120
|
+
line["Company"] = row.select("a")[0].get_text().strip()
|
121
|
+
masterlist.append(line)
|
122
|
+
|
123
|
+
if len(masterlist) == 0:
|
124
|
+
logger.error(
|
125
|
+
"No data scraped -- anti-scraping mechanism may be back in play -- try Google Cache?"
|
126
|
+
)
|
127
|
+
output_csv = data_dir / "hi.csv"
|
128
|
+
utils.write_dict_rows_to_csv(output_csv, headers, masterlist)
|
129
|
+
return output_csv
|
130
|
+
|
131
|
+
|
132
|
+
if __name__ == "__main__":
|
133
|
+
scrape()
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.1
|
2
2
|
Name: warn-scraper
|
3
|
-
Version: 1.2.
|
3
|
+
Version: 1.2.60
|
4
4
|
Summary: Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
|
5
5
|
Home-page: https://github.com/biglocalnews/warn-scraper
|
6
6
|
Author: Big Local News
|
@@ -1,98 +0,0 @@
|
|
1
|
-
import datetime
|
2
|
-
import logging
|
3
|
-
from pathlib import Path
|
4
|
-
|
5
|
-
from bs4 import BeautifulSoup
|
6
|
-
|
7
|
-
from .. import utils
|
8
|
-
|
9
|
-
__authors__ = ["Ash1R", "stucka"]
|
10
|
-
__tags__ = ["html"]
|
11
|
-
__source__ = {
|
12
|
-
"name": "Workforce Development Hawaii",
|
13
|
-
"url": "https://labor.hawaii.gov/wdc/real-time-warn-updates/",
|
14
|
-
}
|
15
|
-
|
16
|
-
logger = logging.getLogger(__name__)
|
17
|
-
|
18
|
-
|
19
|
-
def scrape(
|
20
|
-
data_dir: Path = utils.WARN_DATA_DIR,
|
21
|
-
cache_dir: Path = utils.WARN_CACHE_DIR,
|
22
|
-
) -> Path:
|
23
|
-
"""
|
24
|
-
Scrape data from Hawaii.
|
25
|
-
|
26
|
-
Keyword arguments:
|
27
|
-
data_dir -- the Path were the result will be saved (default WARN_DATA_DIR)
|
28
|
-
cache_dir -- the Path where results can be cached (default WARN_CACHE_DIR)
|
29
|
-
Returns: the Path where the file is written
|
30
|
-
"""
|
31
|
-
firstpage = utils.get_url("https://labor.hawaii.gov/wdc/real-time-warn-updates/")
|
32
|
-
soup = BeautifulSoup(firstpage.text, features="html5lib")
|
33
|
-
pagesection = soup.select("div.primary-content")[0]
|
34
|
-
subpageurls = []
|
35
|
-
for atag in pagesection.find_all("a"):
|
36
|
-
href = atag["href"]
|
37
|
-
if href.endswith("/"):
|
38
|
-
href = href[:-1]
|
39
|
-
subpageurls.append(href)
|
40
|
-
|
41
|
-
headers = ["Company", "Date", "PDF url", "location", "jobs"]
|
42
|
-
data = [headers]
|
43
|
-
# lastdateseen = "2099-12-31"
|
44
|
-
|
45
|
-
for subpageurl in reversed(subpageurls):
|
46
|
-
# Conditionally here, we want to check and see if we have the old cached files, or if the year is current or previous.
|
47
|
-
# Only need to download if it's current or previous year.
|
48
|
-
# But do we care enough to implement right now?
|
49
|
-
|
50
|
-
logger.debug(f"Parsing page {subpageurl}")
|
51
|
-
page = utils.get_url(subpageurl)
|
52
|
-
soup = BeautifulSoup(page.text, features="html5lib")
|
53
|
-
pageyear = subpageurl.split("/")[-1][:4]
|
54
|
-
tags = soup.select("p a[href*=pdf]")
|
55
|
-
p_tags = [i.parent.get_text().replace("\xa0", " ").split("\n") for i in tags]
|
56
|
-
clean_p_tags = [j for i in p_tags for j in i]
|
57
|
-
|
58
|
-
dates = [k.split("–")[0].strip() for k in clean_p_tags]
|
59
|
-
for i in range(len(dates)):
|
60
|
-
try:
|
61
|
-
tempdate = dates[i].split(pageyear)[0].strip() + f" {pageyear}"
|
62
|
-
parsed_date = datetime.datetime.strptime(
|
63
|
-
tempdate, "%B %d, %Y"
|
64
|
-
).strftime("%Y-%m-%d")
|
65
|
-
dates[i] = parsed_date
|
66
|
-
# lastdateseen = parsed_date
|
67
|
-
|
68
|
-
# Disabling amendment automation to shift fixes into warn-transformer instead.
|
69
|
-
# If this needs to come back, uncomment the lastseendate references
|
70
|
-
# then rebuild the below section as an else
|
71
|
-
except ValueError:
|
72
|
-
logger.debug(f"Date error: {dates[i]}, leaving intact")
|
73
|
-
# if "*" in dates[i]:
|
74
|
-
# logger.debug(
|
75
|
-
# f"Date error: {dates[i]} as apparent amendment; saving as {lastdateseen}"
|
76
|
-
# )
|
77
|
-
# dates[i] = lastdateseen
|
78
|
-
# else:
|
79
|
-
|
80
|
-
for i in range(len(tags)):
|
81
|
-
row = []
|
82
|
-
url = tags[i].get("href")
|
83
|
-
row.append(tags[i].get_text())
|
84
|
-
|
85
|
-
row.append(dates[i])
|
86
|
-
|
87
|
-
row.append(url)
|
88
|
-
row.append(None) # location
|
89
|
-
row.append(None) # jobs
|
90
|
-
data.append(row)
|
91
|
-
|
92
|
-
output_csv = data_dir / "hi.csv"
|
93
|
-
utils.write_rows_to_csv(output_csv, data)
|
94
|
-
return output_csv
|
95
|
-
|
96
|
-
|
97
|
-
if __name__ == "__main__":
|
98
|
-
scrape()
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
{warn-scraper-1.2.58 → warn-scraper-1.2.60}/tests/cassettes/test_missing_detail_page_values.yaml
RENAMED
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|