oddb2xml 3.0.28 → 3.0.29
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CLAUDE.md +1 -1
- data/Gemfile.lock +1 -1
- data/lib/oddb2xml/downloader.rb +41 -2
- data/lib/oddb2xml/version.rb +1 -1
- data/scripts/generate_index_html.sh +9 -2
- data/scripts/run_oddb2xml.sh +66 -5
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 4602599ed50439225ff6ce76000f8441a9b4963b02538adf3f46d6728beff6cc
|
|
4
|
+
data.tar.gz: 55f34b946cc7a0e75435459aa90169e23719ac44c4bbfdf1ba3d63c3162f894d
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: cf98f9144a53cd62be8599168423c4c6d560b8474dc015d5281eccb84b2f80de6d46e3765b7ab96cc6271822361ea0a4a7b1f00bae1908e16f94f8fb77006908
|
|
7
|
+
data.tar.gz: 7ed225d5f4cd6f3c3293913cb40e01430839968d34e15551a2192a1f9069343dd98f7448179a0e433bc6485d8810510904add62632e6e5ddf6d03d50b8f41dc8
|
data/CLAUDE.md
CHANGED
|
@@ -68,7 +68,7 @@ YAML files in `data/` provide manual overrides and mappings: `article_overrides.
|
|
|
68
68
|
|
|
69
69
|
These scripts run the public download server at `https://mediupdatexml.oddb.org` (Apache on this host) and are **not** part of the gem itself.
|
|
70
70
|
|
|
71
|
-
- **`run_oddb2xml.sh`** — nightly build driver (cron: `0 1 * * * zdavatz`). Downloads the upstream sources **once**, then builds the `-b`/firstbase feed at price increments `45/50/55` plus `default` (no increment) into `$OUT_DIR` (`/home/zdavatz/oddb2xml`, one subdir each). The shared `downloads/` cache and transient zip live in `$BUILD_DIR` (`<OUT_DIR>-build`), **outside** `$OUT_DIR` so the transfer never uploads the multi-hundred-MB cache. Final step ("2b") regenerates the landing page. Each `oddb2xml` invocation is wrapped in `run_with_retry` (default **3 attempts, 120 s apart**, tunable via `ODDB2XML_RETRIES`/`ODDB2XML_RETRY_DELAY`): a transient upstream download failure (e.g. Swissmedic resetting the connection, `Errno::ECONNRESET`) previously aborted the whole `set -e` run 14 s in and left the feeds a day stale, so it now retries before giving up; a genuine repeated failure still stops the run.
|
|
71
|
+
- **`run_oddb2xml.sh`** — nightly build driver (cron: `0 1 * * * zdavatz`). Downloads the upstream sources **once**, then builds the `-b`/firstbase feed at price increments `45/50/55` plus `default` (no increment) into `$OUT_DIR` (`/home/zdavatz/oddb2xml`, one subdir each). The shared `downloads/` cache and transient zip live in `$BUILD_DIR` (`<OUT_DIR>-build`), **outside** `$OUT_DIR` so the transfer never uploads the multi-hundred-MB cache. Final step ("2b") regenerates the landing page. Each `oddb2xml` invocation is wrapped in `run_with_retry` (default **3 attempts, 120 s apart**, tunable via `ODDB2XML_RETRIES`/`ODDB2XML_RETRY_DELAY`): a transient upstream download failure (e.g. Swissmedic resetting the connection, `Errno::ECONNRESET`) previously aborted the whole `set -e` run 14 s in and left the feeds a day stale, so it now retries before giving up; a genuine repeated failure still stops the run. **Firstbase (GS1 NONPHARMA) last-good fallback (3.0.29 onwards):** the GS1 `GetFirstbaseHealthcare` export (`id.gs1.ch/01/07612345000961` → `apitools.gs1.ch`) has been answering `403 - Forbidden`, which blanked `firstbase.csv` and dropped **every** NONPHARMA article from the `-b` feed (landing page then showed `NONPHARMA = 0 − 1 = −1`). The script keeps the last successful `firstbase.csv` in a persistent cache `$FIRSTBASE_CACHE` (default `<OUT_DIR>-state/firstbase.csv`) **outside** `$BUILD_DIR` so it survives the nightly `rm -rf`, seeds it into `downloads/` before the build, and refreshes it after a successful download. The gem side (`FirstbaseDownloader#download`, rewritten in 3.0.29) makes the seed usable: it still attempts the live GS1 fetch on the first (downloading) build, but only overwrites `firstbase.csv` when the response is a real non-empty CSV (`firstbase_csv?` rejects HTML/`403 - Forbidden`/empty bodies and open-uri exceptions), otherwise it **keeps the existing seeded file** instead of the old `"w+"` truncate-to-zero. A recovered GS1 therefore refreshes the data automatically; while GS1 is down the feed serves yesterday's (last-good) NONPHARMA rather than nothing. `generate_index_html.sh` also guards the NONPHARMA count so an empty CSV renders `—` (not `−1`).
|
|
72
72
|
- **`generate_index_html.sh DOCROOT [FIRSTBASE_CSV]`** — single source of truth for the landing page. Writes `index.html` + a self-contained `logo.svg` **atomically** (temp + `mv`, so either owner — root from setup, `zdavatz` from cron — can refresh it). Computes live counts: PHARMA = `<SMNO>` count in `default/oddb_article.xml`, NONPHARMA = firstbase CSV rows − 1, total ART = `<ART ` count. Also runs **`visitor_stats.py`** and embeds its graph. Re-run standalone any time (it only reads already-built files); a separate cron line refreshes it **hourly** (`5 * * * * zdavatz`) so counts + graph stay current between nightly builds.
|
|
73
73
|
- **`visitor_stats.py LOG_GLOB CACHE_DIR [DAYS]`** — emits the visitors/sessions/region graph as an inline-SVG HTML **fragment** (last `DAYS`, default 14): Besucher = distinct IPs/day, Sitzungen = 30-min-inactivity sessions per `(IP, User-Agent)`, plus a top-6 country breakdown by IP. Bots are filtered by User-Agent. Region lookup is **fully self-contained** — pure Python stdlib + the free **DB-IP country-lite CSV** (CC-BY, no licence key) cached in the build `downloads/` dir and refreshed monthly; **no apt package, no gem, no system GeoIP DB**. Prints nothing (page degrades to omitting the section) when the Apache log is unreadable or empty. Reading `/var/log/apache2` requires the cron user to be in the **`adm`** group (`sudo usermod -aG adm zdavatz`).
|
|
74
74
|
- **`swissmedic_watch.sh`** — outage/block auto-recovery (cron: `*/30 * * * * zdavatz`). Since the Swissmedic platform migration (~2026-06-23, now a Swisscom-operated gateway), `www.swissmedic.ch` intermittently resets this host's automated connections **after the TLS handshake** (TCP RST), which aborts `run_oddb2xml.sh` under `set -e` and leaves the feeds stale (the block is host/IP- and client-fingerprint-sensitive: a real browser works, `curl`/`wget`/Ruby get reset, while other admin.ch hosts answer fine — so it is a WAF/bot rule, not an outage). The watcher polls Swissmedic with **oddb2xml's own client** (a Ruby `open-uri` canary on `listen_neu.html`); while blocked it is a silent no-op, and the moment it gets HTTP 200 it launches **one** build and emails. It fires **at most once per day** (stamp in `$STATE_DIR`, default `<OUT_DIR>-watch`, kept **outside** the wiped `$BUILD_DIR`), and skips when a build is already running or today's `default/oddb_article.xml` is already fresh. It exports `RBENV_VERSION=3.4.5` + the rbenv-shims PATH to match the nightly cron (the repo `.ruby-version` pins an uninstalled Ruby).
|
data/Gemfile.lock
CHANGED
data/lib/oddb2xml/downloader.rb
CHANGED
|
@@ -396,14 +396,53 @@ module Oddb2xml
|
|
|
396
396
|
@url = BASE_URL
|
|
397
397
|
end
|
|
398
398
|
|
|
399
|
+
# A valid firstbase export is a non-empty CSV. When GS1 is unavailable it
|
|
400
|
+
# answers with an HTML error page (the GetFirstbaseHealthcare endpoint has
|
|
401
|
+
# been returning "403 - Forbidden: Access is denied") or open-uri raises.
|
|
402
|
+
# The old "w+" download truncated firstbase.csv to zero bytes on any such
|
|
403
|
+
# failure, silently dropping every NONPHARMA article. Reject non-CSV bodies
|
|
404
|
+
# so the caller can keep the previous good firstbase.csv instead.
|
|
405
|
+
def firstbase_csv?(text)
|
|
406
|
+
return false if text.nil?
|
|
407
|
+
head = text[0, 512].to_s.strip.downcase
|
|
408
|
+
return false if head.empty?
|
|
409
|
+
return false if head.start_with?("<!doctype", "<html", "<?xml")
|
|
410
|
+
return false if head.include?("403 - forbidden") || head.include?("access is denied")
|
|
411
|
+
true
|
|
412
|
+
end
|
|
413
|
+
|
|
399
414
|
def download
|
|
400
415
|
@file2save = File.join(DOWNLOADS, "firstbase.csv")
|
|
401
416
|
report_download(@url, @file2save)
|
|
402
|
-
|
|
403
|
-
|
|
417
|
+
# Price-increment / Artikelstamm runs (--skip-download) reuse the cached
|
|
418
|
+
# firstbase.csv. Do NOT skip merely because the file exists: the nightly
|
|
419
|
+
# deploy seeds a last-good copy so a GS1 outage does not blank the feed,
|
|
420
|
+
# and we still want a fresh download attempt on the first (downloading)
|
|
421
|
+
# build so a recovered GS1 refreshes the data.
|
|
422
|
+
if Oddb2xml.skip_download? && File.size?(@file2save)
|
|
423
|
+
Oddb2xml.log "FirstbaseDownloader: --skip-download, reusing cached #{@file2save} (#{File.size(@file2save)} bytes)"
|
|
404
424
|
return File.expand_path(@file2save)
|
|
425
|
+
end
|
|
426
|
+
begin
|
|
427
|
+
data = Oddb2xml.uri_open(@url).read
|
|
428
|
+
if firstbase_csv?(data)
|
|
429
|
+
File.write(@file2save, data)
|
|
430
|
+
Oddb2xml.log "FirstbaseDownloader: fetched fresh firstbase.csv (#{data.bytesize} bytes)"
|
|
431
|
+
elsif File.size?(@file2save)
|
|
432
|
+
Oddb2xml.log "FirstbaseDownloader: GS1 returned no CSV (#{data.to_s.bytesize} bytes); keeping existing #{@file2save} (#{File.size(@file2save)} bytes)"
|
|
433
|
+
else
|
|
434
|
+
Oddb2xml.log "FirstbaseDownloader: GS1 returned no CSV and there is no cached firstbase.csv to fall back to"
|
|
435
|
+
end
|
|
405
436
|
rescue Timeout::Error, Errno::ETIMEDOUT
|
|
406
437
|
retrievable? ? retry : raise
|
|
438
|
+
rescue => error
|
|
439
|
+
# 403 / blocked / unreachable: keep any existing firstbase.csv (e.g. the
|
|
440
|
+
# last-good copy the deploy script seeds) rather than truncating it.
|
|
441
|
+
if File.size?(@file2save)
|
|
442
|
+
Oddb2xml.log "FirstbaseDownloader: download failed (#{error.class}: #{error}); keeping existing #{@file2save} (#{File.size(@file2save)} bytes)"
|
|
443
|
+
else
|
|
444
|
+
Oddb2xml.log "FirstbaseDownloader: download failed (#{error.class}: #{error}) and no cached firstbase.csv to fall back to"
|
|
445
|
+
end
|
|
407
446
|
ensure
|
|
408
447
|
Oddb2xml.download_finished(@file2save, false)
|
|
409
448
|
end
|
data/lib/oddb2xml/version.rb
CHANGED
|
@@ -35,7 +35,13 @@ pharma="—"
|
|
|
35
35
|
[[ -f "$ARTICLE_XML" ]] && pharma=$(grep -c '<SMNO>' "$ARTICLE_XML" || true)
|
|
36
36
|
|
|
37
37
|
nonpharma="—"
|
|
38
|
-
|
|
38
|
+
# Only count when the CSV actually has data rows. An empty firstbase.csv (the
|
|
39
|
+
# GS1 firstbase dump upstream returns 403/empty from time to time) would make a
|
|
40
|
+
# naive `rows - 1` render "-1"; keep the "—" fallback instead.
|
|
41
|
+
if [[ -s "$FIRSTBASE_CSV" ]]; then
|
|
42
|
+
fb_rows=$(wc -l < "$FIRSTBASE_CSV")
|
|
43
|
+
(( fb_rows > 1 )) && nonpharma=$(( fb_rows - 1 ))
|
|
44
|
+
fi
|
|
39
45
|
|
|
40
46
|
stand=$(date '+%d.%m.%Y %H:%M')
|
|
41
47
|
|
|
@@ -150,7 +156,8 @@ cat > "$tmp" <<HTML
|
|
|
150
156
|
|
|
151
157
|
<h2>Elexis Artikelstamm</h2>
|
|
152
158
|
<ul>
|
|
153
|
-
<li><a href="artikelstamm/">artikelstamm/</a> <span class="desc">— Elexis Artikelstamm v6 (mit BAG-Indikationscodes) und Legacy-v5, je als XML und CSV, täglich aktualisiert</span></li>
|
|
159
|
+
<li><a href="artikelstamm/">artikelstamm/</a> <span class="desc">— Elexis Artikelstamm v6 (mit BAG-Indikationscodes) und Legacy-v5, je als XML und CSV, täglich aktualisiert (erzeugt mit oddb2xml)</span></li>
|
|
160
|
+
<li><a href="artikelstamm/rust2xml/">artikelstamm/rust2xml/</a> <span class="desc">— derselbe Artikelstamm v6 + v5, erzeugt mit <a href="https://github.com/zdavatz/rust2xml">rust2xml</a> (Rust-Port), täglich um 03:00 aus denselben Live-Quellen</span></li>
|
|
154
161
|
</ul>
|
|
155
162
|
|
|
156
163
|
<h2>aips2sqlite — Fachinformationen & AmiKo-Datenbanken</h2>
|
data/scripts/run_oddb2xml.sh
CHANGED
|
@@ -81,6 +81,41 @@ rm -rf "$BUILD_DIR"
|
|
|
81
81
|
mkdir -p "$BUILD_DIR"
|
|
82
82
|
cd "$BUILD_DIR"
|
|
83
83
|
|
|
84
|
+
# 3. Seed the ZurRose transfer.zip from the local get_transfer mirror.
|
|
85
|
+
# get_transfer.sh (crontab 00:30) downloads transfer.dat straight from
|
|
86
|
+
# zurrose.ch on THIS host and uploads the zip to pillbox.oddb.org — so the
|
|
87
|
+
# pillbox HTTP fetch is a needless detour back to our own file. Placing the
|
|
88
|
+
# zip in downloads/ makes oddb2xml's skip_download reuse it and the build no
|
|
89
|
+
# longer depends on pillbox being up (2026-07-02: pillbox refused connections
|
|
90
|
+
# during the 01:00 run and all three retries died, killing the whole nightly
|
|
91
|
+
# build). If the seed file is missing, oddb2xml falls back to the normal
|
|
92
|
+
# pillbox download as before.
|
|
93
|
+
GET_TRANSFER_ZIP="${GET_TRANSFER_ZIP:-/home/zdavatz/software/get_transfer/TRANSFER.ZIP}"
|
|
94
|
+
if [[ -s "$GET_TRANSFER_ZIP" ]]; then
|
|
95
|
+
mkdir -p "$BUILD_DIR/downloads"
|
|
96
|
+
cp -p "$GET_TRANSFER_ZIP" "$BUILD_DIR/downloads/transfer.zip"
|
|
97
|
+
log "Seeded ZurRose transfer.zip from $GET_TRANSFER_ZIP ($(date -r "$GET_TRANSFER_ZIP" '+%Y-%m-%d %H:%M'), no pillbox fetch needed)"
|
|
98
|
+
else
|
|
99
|
+
log "WARNING: $GET_TRANSFER_ZIP missing - falling back to pillbox.oddb.org download"
|
|
100
|
+
fi
|
|
101
|
+
|
|
102
|
+
# 3b. Firstbase (GS1 NONPHARMA) fallback. The GS1 GetFirstbaseHealthcare
|
|
103
|
+
# endpoint has been answering "403 - Forbidden", which blanked firstbase.csv and
|
|
104
|
+
# dropped every NONPHARMA article from the -b feed. Keep the last successful
|
|
105
|
+
# firstbase.csv in a persistent cache OUTSIDE $BUILD_DIR (it survives the nightly
|
|
106
|
+
# `rm -rf`) and seed it into downloads/ so the gem's FirstbaseDownloader falls
|
|
107
|
+
# back to yesterday's file when today's download fails. A recovered GS1 still
|
|
108
|
+
# refreshes the data: the first (downloading) build always retries the live
|
|
109
|
+
# fetch and only keeps the seed when that fetch yields no CSV.
|
|
110
|
+
FIRSTBASE_CACHE="${FIRSTBASE_CACHE:-${OUT_DIR%/}-state/firstbase.csv}"
|
|
111
|
+
if [[ -s "$FIRSTBASE_CACHE" ]]; then
|
|
112
|
+
mkdir -p "$BUILD_DIR/downloads"
|
|
113
|
+
cp -p "$FIRSTBASE_CACHE" "$BUILD_DIR/downloads/firstbase.csv"
|
|
114
|
+
log "Seeded firstbase.csv from last-good cache $FIRSTBASE_CACHE ($(($(wc -l < "$FIRSTBASE_CACHE") - 1)) rows, $(date -r "$FIRSTBASE_CACHE" '+%Y-%m-%d %H:%M'))"
|
|
115
|
+
else
|
|
116
|
+
log "No firstbase last-good cache at $FIRSTBASE_CACHE yet - relying on live GS1 download"
|
|
117
|
+
fi
|
|
118
|
+
|
|
84
119
|
first=1
|
|
85
120
|
|
|
86
121
|
# build_one <increment-percent|""> <destination-subdir>
|
|
@@ -132,17 +167,43 @@ build_artikelstamm() {
|
|
|
132
167
|
shopt -u nullglob
|
|
133
168
|
[[ ${#out[@]} -ge 1 ]] || { log "ERROR: no artikelstamm output produced"; exit 1; }
|
|
134
169
|
|
|
135
|
-
rm -rf "$dest"
|
|
136
170
|
mkdir -p "$dest"
|
|
137
|
-
|
|
171
|
+
# Remove only oddb2xml's own top-level files; keep sub-directories such as
|
|
172
|
+
# rust2xml/ (published independently by rust2xml's own cron at 03:00) intact.
|
|
173
|
+
# A plain `rm -rf "$dest"` used to wipe that sibling output every night.
|
|
174
|
+
rm -f "$dest"/artikelstamm_*.xml "$dest"/artikelstamm_*.csv
|
|
175
|
+
# Publish under date-less, stable names so the download URLs never change:
|
|
176
|
+
# artikelstamm_01072026_v6.xml -> artikelstamm_v6.xml (same for _v5 / .csv).
|
|
177
|
+
local f base
|
|
178
|
+
for f in "${out[@]}"; do
|
|
179
|
+
base="$(basename "$f" | sed -E 's/_[0-9]{8}_/_/')"
|
|
180
|
+
cp -p "$f" "$dest/$base"
|
|
181
|
+
done
|
|
138
182
|
log "Staged ${#out[@]} file(s) to $dest"
|
|
139
183
|
}
|
|
140
184
|
|
|
185
|
+
# Build order: default first (it downloads the shared sources), then the
|
|
186
|
+
# Artikelstamm right after so it is published early, then the price increments.
|
|
187
|
+
build_one "" "default" # first run: downloads sources, no increment
|
|
188
|
+
|
|
189
|
+
# Refresh the last-good firstbase.csv cache after the downloading build. When
|
|
190
|
+
# GS1 answered, downloads/firstbase.csv now holds fresh data; when it 403'd, the
|
|
191
|
+
# gem kept the seeded copy - either way a non-empty file is worth caching so the
|
|
192
|
+
# next run can fall back to it. An empty file means both today's download AND the
|
|
193
|
+
# seed were missing, so leave the previous cache untouched.
|
|
194
|
+
FIRSTBASE_LIVE="$BUILD_DIR/downloads/firstbase.csv"
|
|
195
|
+
if [[ -s "$FIRSTBASE_LIVE" ]]; then
|
|
196
|
+
mkdir -p "$(dirname "$FIRSTBASE_CACHE")"
|
|
197
|
+
cp -p "$FIRSTBASE_LIVE" "$FIRSTBASE_CACHE"
|
|
198
|
+
log "Cached firstbase.csv as last-good ($(($(wc -l < "$FIRSTBASE_LIVE") - 1)) rows) -> $FIRSTBASE_CACHE"
|
|
199
|
+
else
|
|
200
|
+
log "WARNING: firstbase.csv is empty after the build (GS1 403 and no cache) - NONPHARMA missing this run"
|
|
201
|
+
fi
|
|
202
|
+
|
|
203
|
+
build_artikelstamm # Elexis Artikelstamm (v6 + legacy v5)
|
|
141
204
|
for inc in $INCREMENTS; do
|
|
142
|
-
build_one "$inc" "$inc"
|
|
205
|
+
build_one "$inc" "$inc" # price increments re-use the cached downloads/
|
|
143
206
|
done
|
|
144
|
-
build_one "" "default" # final run with no increment
|
|
145
|
-
build_artikelstamm # Elexis Artikelstamm (v6 + legacy v5)
|
|
146
207
|
|
|
147
208
|
# 2b. Refresh the download landing page with the live PHARMA/NONPHARMA counts
|
|
148
209
|
# (PHARMA from default/oddb_article.xml, NONPHARMA from the GS1 firstbase CSV).
|