PyPI - memories-crawl - Versions diffs - 0.2.0__tar.gz - Mend

memories-crawl 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

memories_crawl-0.2.0/.claude/memory/MEMORY.md +4 -0
memories_crawl-0.2.0/.claude/memory/project_memories_crawl.md +16 -0
memories_crawl-0.2.0/.claude/memory/project_overijssel_blocker.md +45 -0
memories_crawl-0.2.0/.claude/settings.local.json +12 -0
memories_crawl-0.2.0/.github/workflows/ci.yml +44 -0
memories_crawl-0.2.0/.github/workflows/publish.yml +31 -0
memories_crawl-0.2.0/.gitignore +9 -0
memories_crawl-0.2.0/.python-version +1 -0
memories_crawl-0.2.0/CLAUDE.md +376 -0
memories_crawl-0.2.0/GUIDE.md +64 -0
memories_crawl-0.2.0/PKG-INFO +446 -0
memories_crawl-0.2.0/README.md +422 -0
memories_crawl-0.2.0/pyproject.toml +57 -0
memories_crawl-0.2.0/src/memories_crawl/__init__.py +0 -0
memories_crawl-0.2.0/src/memories_crawl/__main__.py +5 -0
memories_crawl-0.2.0/src/memories_crawl/bhic.py +386 -0
memories_crawl-0.2.0/src/memories_crawl/cli.py +176 -0
memories_crawl-0.2.0/src/memories_crawl/drentsarchief.py +248 -0
memories_crawl-0.2.0/src/memories_crawl/friesland.py +350 -0
memories_crawl-0.2.0/src/memories_crawl/gelderland.py +664 -0
memories_crawl-0.2.0/src/memories_crawl/limburg.py +556 -0
memories_crawl-0.2.0/src/memories_crawl/nationaalarchief.py +367 -0
memories_crawl-0.2.0/src/memories_crawl/noordholland.py +663 -0
memories_crawl-0.2.0/src/memories_crawl/overijssel.py +375 -0
memories_crawl-0.2.0/src/memories_crawl/utrechtsarchief.py +598 -0
memories_crawl-0.2.0/src/memories_crawl/zeeland.py +658 -0
memories_crawl-0.2.0/tests/test_smoke.py +65 -0
memories_crawl-0.2.0/uv.lock +318 -0
memories_crawl-0.2.0/websites.md +203 -0

memories_crawl-0.2.0/.claude/memory/MEMORY.md ADDED Viewed

@@ -0,0 +1,4 @@
+# Memory Index
+- [Project: Memories van Successie pipeline](project_memories_crawl.md) — all 8 archives complete including Overijssel (implemented April 2026)
+- [Overijssel MAIS token extraction — SOLVED](project_overijssel_blocker.md) — complete kantoor minr map, per-page token approach, corrected image URL format

memories_crawl-0.2.0/.claude/memory/project_memories_crawl.md ADDED Viewed

@@ -0,0 +1,16 @@
+---
+name: Memories van Successie pipeline
+description: Python pipeline to download Dutch succession registers from 8 regional archives; all archives implemented
+type: project
+---
+Downloads all *Memories van Successie* (Dutch inheritance registers, 1806–1927) from 8 Dutch regional archives. Scans are saved with `metadata.json` sidecars.
+**Status** (as of 2026-04-14):
+- ✅ Open Archieven (5 archives: BHIC, Zeeuws Archief, HUA, Gelders, NHA) — 3-step pipeline via OAI-PMH XML dumps
+- ✅ Nationaal Archief (Zuid-Holland, access 3.06.05) — Drupal viewer scrape
+- ✅ Drents Archief (Drenthe) — Memorix REST API
+- ✅ Overijssel (HCO) — Playwright-based MAIS token extraction, implemented April 2026
+**Why:** Complete download for genealogical/historical research purposes.
+**How to apply:** `python/overijssel.py` is fully implemented. Run with `uv run python main.py overijssel`. Requires `playwright install chromium` after `uv sync`.

memories_crawl-0.2.0/.claude/memory/project_overijssel_blocker.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: Overijssel MAIS token extraction — SOLVED
+description: How MAIS per-page auth tokens are extracted and what the complete kantoor minr mapping is
+type: project
+---
+The Historisch Centrum Overijssel (HCO) uses the MAIS Internet viewer system. Images at `preserve2.archieven.nl` require three per-page auth tokens: `miahd`, `rdt`, `open`. These are **per-page**: each scan page has its own unique `miahd` and `open`.
+**SOLVED April 2026** — `python/overijssel.py` is fully implemented.
+## How token extraction works
+1. Navigate (Playwright/Chromium) to the MAIS inv3 page for the kantoor minr:
+   `https://collectieoverijssel.nl/collectie/archieven/?mivast=20&mizig=210&miadt=141&miaet=1&micode=0136.4&minr={minr}&milang=nl&miview=inv3`
+2. The page auto-loads PHPSESSID + mi_sessid cookies (WordPress + MAIS).
+3. Collect all `a[onclick*="stk3"]` links — each is one invnr volume.
+4. For each: call `mi_inv3_toggle_stk(args)` via `page.evaluate()`.
+5. Wait ~1.5s, harvest `img[src*="/fonc-hco/"]` from the DOM.
+6. Parse `invnr`, `page`, `miahd`, `rdt`, `open` from each src URL.
+## Image URL format (corrected — original stub was wrong)
+```
+https://preserve2.archieven.nl/mi-20/fonc-hco/0136.4/{invnr}/
+    NL-ZlHCO_0136.4_{invnr}_{page:04d}.jpg
+    ?miadt=141&miahd={miahd}&mivast=20&rdt={rdt}&open={open}
+```
+Note the `{invnr}/` subdirectory — the original stub was missing this.
+## Complete KANTOOR_MINR mapping (verified April 2026)
+```python
+KANTOOR_MINR = {
+    "Almelo":     2227676,
+    "Deventer":   2227950,
+    "Enschede":   2228207,
+    "Goor":       2228335,   # original stub had wrong names: Hardenberg, Oldenzaal, Overige
+    "Kampen":     2228502,
+    "Ommen":      2228649,
+    "Raalte":     2228752,
+    "Steenwijk":  2228889,
+    "Vollenhove": 2228980,
+    "Zwolle":     2229046,
+}
+```

memories_crawl-0.2.0/.claude/settings.local.json ADDED Viewed

@@ -0,0 +1,12 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(git:*)",
+      "mcp__playwright__browser_network_requests",
+      "mcp__playwright__browser_evaluate",
+      "mcp__playwright__browser_wait_for",
+      "mcp__playwright__browser_navigate",
+      "WebFetch(domain:www.bhic.nl)"
+    ]
+  }
+}

memories_crawl-0.2.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,44 @@
+name: CI
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+jobs:
+  check:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.12", "3.14"]
+      fail-fast: false
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+          enable-cache: true
+          cache-dependency-glob: "uv.lock"
+      - name: Install dependencies
+        run: uv sync --all-groups
+      - name: Lint with ruff
+        run: uv run ruff check
+      - name: Check formatting with ruff
+        run: uv run ruff format --check
+      - name: Run smoke tests
+        run: uv run pytest -v --tb=short
+      - name: Build package
+        run: uv build

memories_crawl-0.2.0/.github/workflows/publish.yml ADDED Viewed

@@ -0,0 +1,31 @@
+name: Publish to PyPI
+on:
+  push:
+    tags:
+      - "v*"
+  workflow_dispatch:
+permissions:
+  id-token: write
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: "3.12"
+          enable-cache: true
+      - name: Build
+        run: uv build
+      - name: Publish
+        env:
+          UV_PUBLISH_TOKEN: ${{ secrets.PYPI_TOKEN }}
+        run: uv publish

memories_crawl-0.2.0/.gitignore ADDED Viewed

@@ -0,0 +1,9 @@
+.env
+__pycache__/
+.playwright-mcp/
+scans/
+dumps/
+*.csv
+*.txt
+dist/
+.venv/

memories_crawl-0.2.0/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.14

memories_crawl-0.2.0/CLAUDE.md ADDED Viewed

@@ -0,0 +1,376 @@
+# CLAUDE.md – Memories van Successie Pipeline
+## What this project does
+Downloads all surviving *Memories van Successie* (Dutch succession/inheritance registers, 1806–1927) from ten regional Dutch archives. Each scan is saved alongside a `metadata.json` sidecar.
+## How to run
+```bash
+uv run memories-crawl friesland          # Friesland (Tresoar / AlleFriezen, Memorix API)
+uv run memories-crawl nationaalarchief   # Zuid-Holland (Nationaal Archief 3.06.05)
+uv run memories-crawl drentsarchief      # Drenthe (Memorix API)
+uv run memories-crawl bhic               # Noord-Brabant (BHIC Memorix API)
+uv run memories-crawl overijssel         # Overijssel (HCO) – requires Playwright
+uv run memories-crawl utrechtsarchief    # Utrecht (Het Utrechts Archief) – requires Playwright
+uv run memories-crawl limburg            # Limburg (RHCL, archieven.nl MAIS) – requires Playwright
+uv run memories-crawl noordholland       # Noord-Holland (Noord-Hollands Archief) – requires Playwright
+uv run memories-crawl zeeland            # Zeeland (Zeeuws Archief) – requires Playwright
+uv run memories-crawl gelderland         # Gelderland (Gelders Archief) – requires Playwright
+uv run memories-crawl all
+```
+## File map
+| File | Purpose |
+|---|---|
+| `src/memories_crawl/cli.py` | CLI dispatcher |
+| `src/memories_crawl/nationaalarchief.py` | Zuid-Holland: scrape viewer pages, download via UUID |
+| `src/memories_crawl/drentsarchief.py` | Drenthe: Memorix REST API, deed→asset chain |
+| `src/memories_crawl/bhic.py` | Noord-Brabant (BHIC): Memorix REST API, register→asset chain |
+| `src/memories_crawl/overijssel.py` | Overijssel: Playwright-based MAIS token extraction |
+| `src/memories_crawl/utrechtsarchief.py` | Utrecht: Playwright-based MAIS stk3 inline strip extraction |
+| `src/memories_crawl/limburg.py` | Limburg (RHCL): Playwright on archieven.nl, strip Volgende-step |
+| `src/memories_crawl/noordholland.py` | Noord-Holland: Playwright-based MAIS stk3 inline strip extraction |
+| `src/memories_crawl/zeeland.py` | Zeeland: Playwright-based MAIS hybrid (inv3 discovery + inv2 strip harvest) |
+| `src/memories_crawl/friesland.py` | Friesland: Tresoar / AlleFriezen Memorix REST API, register→deed→person chain |
+| `src/memories_crawl/gelderland.py` | Gelderland: Playwright-based MAIS, one micode per kantoor (21 codes), strip auto-loads on inv2 minr |
+## Exclusion rule
+**Always exclude Tafel V-bis.** In all parsers and filters, skip any record whose SourceType contains "tafel" or "v-bis" (case-insensitive). The Nationaal Archief Tafel V-bis items are in a different inventory section (outside 2276–2357) and are excluded by range.
+---
+## Overijssel (HCO) – MAIS token extraction
+First-time setup: `uv sync && playwright install chromium`
+The HCO uses a MAIS Internet viewer. Each scan page requires unique per-page tokens
+(`miahd`, `rdt`, `open`). The implementation in `python/overijssel.py`:
+1. Opens the MAIS inv3 page in headless Chromium to establish the session.
+2. Clicks each invnr-item stk3 link via `mi_inv3_toggle_stk(...)`.
+3. Harvests `img[src*="/fonc-hco/"]` from the DOM to get per-page tokens.
+**Image URL format:**
+```
+https://preserve2.archieven.nl/mi-20/fonc-hco/0136.4/{invnr}/
+    NL-ZlHCO_0136.4_{invnr}_{page:04d}.jpg
+    ?miadt=141&miahd={miahd}&mivast=20&rdt={rdt}&open={token}
+```
+**Kantoor minr values** (verified April 2026):
+| Kantoor    | minr    |
+|------------|---------|
+| Almelo     | 2227676 |
+| Deventer   | 2227950 |
+| Enschede   | 2228207 |
+| Goor       | 2228335 |
+| Kampen     | 2228502 |
+| Ommen      | 2228649 |
+| Raalte     | 2228752 |
+| Steenwijk  | 2228889 |
+| Vollenhove | 2228980 |
+| Zwolle     | 2229046 |
+---
+## Pipeline status (verified 2026-04-24)
+Each pipeline was live-tested against the real APIs and servers.
+| Pipeline | API/Server | End-to-end | Notes |
+|---|---|---|---|
+| **friesland** | ✅ | ⚠️ not yet tested | Tresoar / AlleFriezen Memorix REST API. 1,107 registers, ~238k persons. Deed-level assets with .jp2 downloads. Person→deed join via deed_id. Output: scans/friesland/{kantoor}/{invnr}/{person}/. |
+| **nationaalarchief** | ✅ | ✅ | 70 scans downloaded from invnr 2276 in 60s (174 MB). EAD XML parses correctly, drupal-settings-json extraction works, `service.archief.nl` download works. |
+| **drentsarchief** | ✅ | ⚠️ slow start | API returns ~106k deeds. Pipeline must paginate ~1064 pages to collect all deed IDs **before** any download begins (~5 min). Once collection finishes, downloads work (8.3 MB/scan tested). |
+| **overijssel** | ✅ | ⚠️ slow first run | Playwright + Chromium work. Almelo has 256 stk3 items → ~1825 pages of tokens; collecting tokens takes ~6 min per kantoor. Token results are cached in `scans/overijssel/tokens_minr_{minr}.json` — reruns skip Playwright entirely. |
+| **utrechtsarchief** | ✅ | ⚠️ slow first run | Playwright + Chromium. Uses stk3 inline toggle (same approach as Overijssel). Amersfoort verified: 66,615 pages from 211 invnrs across 2 subsections (~12 min harvest). Token results cached per subsection — reruns skip Playwright. 11 kantoren configured. |
+| **limburg** | ✅ | ✅ verified | archieven.nl MAIS (miadt=38, mivast=0). Two codes: 07.D03 (1818-1900, 111 digitized of 1,314, ~104k scans, by place) and 07.D08 (1901-1927, 42 digitized of 460, ~7k scans, by kantoor). End-to-end smoke-tested: invnr 1 (Amby) → 527 pages; invnr 491 (Gennep) → 207 pages. Inventory + tokens cached per code/invnr; reruns skip Playwright. Image format is `format=large` PNG (714×1024); see module docstring for trade-off vs. IIPSrv full-res JP2 path. |
+| **noordholland** | ✅ | ⚠️ not yet tested | noord-hollandsarchief.nl MAIS (miadt=236, mivast=236, micode=178). Uses stk3 inline toggle (same approach as Overijssel/Utrecht). Kantoor sections discovered dynamically from inv2 tree. Tokens cached per section minr; reruns skip Playwright. Image server: preserve-nha.archieven.nl/mi-0/fonc-nha/178/. |
+| **zeeland** | ✅ | ✅ verified | Zeeuws Archief MAIS (miadt=239, mivast=239, micode=398). Hybrid approach: inv3 tree for discovery (kantoor→sub-section→invnr with h_scan markers), inv2 minr pages for strip harvesting (auto-loads strip, force-load all chunks via mi_strip_store.populate()). Goes verified: 990 digitized invnrs of 1,109, invnr 1 → 327 pages, invnr 2 → 373 pages. Image server: preserve-zaf.archieven.nl/mi-239/fonc-zaf/398/. Downloads at `format=large` PNG (673×1024). Filenames include segment slug for uniqueness (e.g. `1-1_0001.jpg`). Tokens cached per kantoor in `tokens_minr_{minr}.json`. |
+| **gelderland** | ✅ | ✅ verified | Gelders Archief MAIS (miadt=37, mivast=37). 21 kantoren, each with its own micode (0021–0037, 0092, 0221–0223). Per-kantoor pipeline: inv2 → pick "Register IV" minr (filter "Tafel VI / V-bis") → inv3 → swapinv-expand period sub-sections → collect leaf invnrs with `h_scan` markers and `^\d+\s` text. Per-invnr token harvest = navigate to inv2&minr=…, force-load strip via `mi_strip_store.populate()`, harvest `img[src*="fonc-gea"]`. Borculo (0022) verified end-to-end: 49 digitized invnrs, invnr 1 → 38 pages, full-size 1024×858 PNG (~570 KB/page). Image server: preserve2.archieven.nl/mi-37/fonc-gea/{code}/. Filenames `{invnr}-{page:04d}.jpg`. Inventory + tokens cached per code; reruns skip Playwright. |
+**Setup reminder**: Chromium must be installed with `uv run playwright install chromium` (not bare `playwright install chromium`).
+---
+## Technical notes
+### Nationaal Archief scan extraction
+Scans are in a `<script data-drupal-selector="drupal-settings-json">` JSON blob. Parse `settings["viewer"]["response"]["scans"]`. Each scan has `{"id": UUID, "label": "NL-HaNA_...", "default": {"url": "https://service.archief.nl/api/file/v1/default/{UUID}"}}`. Download via `default.url`.
+### Drents Archief API
+```
+Base: https://webservices.memorix.nl/genealogy
+Key:  a85387a2-fdb2-44d0-8209-3635e59c537e
+Person search: GET /person?q=*:*&fq=search_s_deed_type_title:"Successiememories"&rows=100&page=N
+Deed detail:   GET /deed/{deed_id}
+Full image:    asset[].download  (e.g. https://images.memorix.nl/dre/download/fullsize/{uuid}.jpg)
+```
+### BHIC (Noord-Brabant) API
+Same Memorix backend, **different tenant key**, and scans live at the **register**
+level (one register = one bound book), not at the deed level.
+```
+Base: https://webservices.memorix.nl/genealogy
+Key:  24c66d08-da4a-4d60-917f-5942681dcaa1
+Register list: GET /register?q=*:*&fq=search_s_type_title:"memorie van successie"&rows=100&page=N
+Assets:        GET /asset?fq=register_id:{register_id}&rows=100&page=N
+Deeds:         GET /deed?fq=register_id:{register_id}&rows=100&page=N
+Persons:       GET /person?fq=register_id:{register_id}&rows=100&page=N
+Full image:    asset[].download  (https://images.memorix.nl/bhic/download/fullsize/{file_id}.jpg)
+```
+1,896 registers total. Code prefixes are `036.03.01..19` (Memories van successie,
+kantoor X) plus `021.13` (Memories van successie Brabant). Tafel V-bis is not
+indexed at BHIC, but `_is_tafel()` filters defensively just in case.
+### Friesland (Tresoar / AlleFriezen) – Memorix REST API
+Tresoar's *Memories van Successie* are served via AlleFriezen, which runs the
+same Memorix Genealogy REST API as Drenthe and BHIC.
+```
+Base: https://webservices.memorix.nl/genealogy
+Key:  aa030ec4-12d0-4dc0-afaf-b65fd6128b39
+Tenant: frl
+Register list: GET /register?q=*:*&fq=search_s_type_title:"Memories van successie"&rows=100&page=N
+Deeds:         GET /deed?fq=register_id:{register_id}&rows=100&page=N
+Persons:       GET /person?fq=register_id:{register_id}&rows=100&page=N
+Full image:    asset[].download → https://tresoar-images.memorix.nl/frl/download/fullsize/{path}.jp2
+```
+1,107 registers total, ~238,576 persons. Entity types: `mvs` (register),
+`mvs_a` (deed/akte), `mvs_a_persoon` (person). One person per deed (the
+"overledene"). Deeds embed their asset references directly (`has_assets: "deed"`,
+`asset[].download`).
+Tafel V-bis is not present in the Tresoar collection (0 results).
+**Person metadata** includes `person_display_name`, `voornaam`, `tussenvoegsel`,
+`geslachtsnaam`, `patroniem`, `datum_overlijden`, `plaats` (overlijdensplaats),
+`plaats_wonen`, `geslacht`.
+Deed metadata includes `nummer` (aktenummer), `plaats`, `diversen`
+(free-text notes with filmnummer, estate details, family relations).
+**Image format**: JPEG 2000 (`.jp2`). No format conversion is done;
+convert with `magick mogrify -format jpg *.jp2` if needed.
+```
+Folder layout
+─────────────
+  scans/friesland/{kantoor}/{invnr}/{person_slug}/
+      {NNNN}.jp2           – sequentially numbered scan pages
+      metadata.json        – per-person info (name, date of death, …)
+```
+Kantoor is extracted from the register `naam` field (e.g. "Sneek" from
+"Memories kantoor Sneek").
+**Resume**: `friesland_progress.csv` tracks completed registers. Existing
+per-person directories (with `metadata.json`) are skipped on reruns.
+### Limburg (RHCL) – archieven.nl MAIS
+Two archive codes hold all Memories van Successie at RHCL:
+| Code   | Period         | Total invnrs | Digitized | Organised by |
+|--------|----------------|--------------|-----------|--------------|
+| 07.D03 | 1818-1900 (1905) | 1,314      | 111       | Plaats (place of death) |
+| 07.D08 | 1901-1927      | 460          | 42        | Kantoor      |
+07.D08 also contains a sibling section "Tafels 5bis" (minr 1014481) which is
+**excluded** per the project-wide Tafel V-bis rule. The scraper drills into
+07.D08's MvS-only sub-section (parent minr 1014062), so the tafel branch is
+never visited.
+```
+inv2 root:      https://www.archieven.nl/nl/zoeken
+                  ?mivast=0&mizig=210&miadt=38&micode={code}&miview=inv2
+per-invnr page: …same…&minr={minr}  (strip auto-loads)
+image URL:      https://preserve3.archieven.nl/mi-0/fonc-rhcl/{code}/{invnr}/
+                  NL-MtHCL_{code}_{invnr}_{page:04d}.jpg
+                  ?format=large&miadt=38&miahd={miahd}&mivast=0&rdt={rdt}&open={token}
+```
+Pagination quirks:
+- The root inv2 page renders only ~100 leaf nodes at a time, with a
+  ``Records N t/m M`` toggle per remaining batch driven by
+  ``mi_inv3_swapinv(...)``. The scraper clicks every batch in-page until none
+  remain.
+- The per-invnr strip exposes only 25 thumbnails initially; the rest are
+  loaded by clicking the ``.snext`` (Volgende) arrow. The scraper steps the
+  arrow until the ``.snavuit`` (disabled) class appears.
+Image format: ``format=large`` returns a 714×1024 PNG (~700 KB-1.2 MB per
+page). The archival 2090×3000 JPEG is only available via the IIPSrv zoomify
+tile server (``iipsrv12.fcgi?FIF=cache/fonc-rhcl/{hash}.jp2&CVT=jpeg``), but
+the ``{invnr,page} → JP2 hash`` map is only exposed inside each scan's
+embed-viewer HTML, so reaching full-res would require an extra viewer load
+per scan (~110 k loads). See module docstring for details.
+Caches:
+- ``scans/limburg/inventory_{code}.json``  – list of digitized invnrs
+- ``scans/limburg/tokens_{code}_{invnr}.json`` – per-page tokens for one register
+Both caches are sufficient for the download phase; rerunning skips Playwright
+entirely once they exist.
+### Noord-Holland (NHA) – noord-hollandsarchief.nl MAIS
+Archive 178 holds all *Memories van Successie* for the province of Noord-Holland.
+The inventory is organized by kantoor (tax office), discovered dynamically from
+the inv2 tree via Playwright.
+**Approach**: Same stk3 inline toggle pattern as Overijssel and Utrecht.
+```
+inv2 root:      https://noord-hollandsarchief.nl/bronnen/archieven
+                  ?mivast=236&mizig=210&miadt=236&micode=178&miview=inv2
+inv3 (kantoor): …same…&miaet=1&micode=178&minr={minr}&milang=nl&miview=inv3
+image URL:      https://preserve-nha.archieven.nl/mi-0/fonc-nha/178/{invnr}/
+                  NL-HlmNHA_178_{invnr}_{page:04d}.jpg
+                  ?miadt=236&miahd={miahd}&mivast=0&rdt={rdt}&open={token}
+```
+**Image format**: Remove `?format=thumb` from thumbnail URLs to get full-size.
+Note that the preserve URL uses `mivast=0` (not 236), same pattern as Limburg.
+**Caches**:
+- ``scans/noordholland/sections.json`` – discovered kantoor sections
+- ``scans/noordholland/tokens_{minr}.json`` – per-page tokens for one kantoor section
+- ``scans/noordholland/tokens_{minr}_partial.json`` – incremental save (crash-resilient)
+**Resume**: ``scans/noordholland/done.txt`` tracks completed kantoor sections.
+Partial token caches allow resuming interrupted harvest runs.
+### Zeeland (Zeeuws Archief) – MAIS token extraction
+First-time setup: ``uv sync && playwright install chromium``
+The Zeeuws Archief runs its own MAIS instance on the zeeuwsarchief.nl domain. The
+scraper takes a **hybrid approach**:
+1. **Discovery** – Navigates to the inv3 tree view for each kantoor minr, expands
+   all sub-sections via swapinv clicks, then harvests inventarisnummer minr values
+   (and their texts) from stk3 onclick handlers. Digitized items are those whose
+   tree node carries an `h_scan.gif` marker. Tafel V-bis filtered by text.
+2. **Token harvest** – Navigates to each invnr's inv2 minr page. The strip viewer
+   auto-loads on this page. All strip chunks are force-loaded via
+   ``mi_strip_store.populate()``, then thumbnail ``<img>`` elements with
+   ``src*="fonc-zaf"`` are harvested from the DOM.
+3. **Download** – Thumbnails have ``?format=thumb``; replacing with ``?format=large``
+   yields 673×1024 PNG. The preserve server is ``preserve-zaf.archieven.nl/mi-239/``.
+**Image URL format:**
+```
+https://preserve-zaf.archieven.nl/mi-239/fonc-zaf/398/{invnr}/
+    NL-MdbZA_398_{invnr}_{slug}_{page:04d}.jpg
+    ?format=large&miadt=239&miahd={miahd}&mivast=239&rdt={rdt}&open={token}
+```
+Some images omit the ``{slug}_`` component (e.g. ``NL-MdbZA_398_1_0001.jpg``). The
+slug provides uniqueness when the same trailing page number appears in multiple
+scan segments within one register.
+**Kantoren** (9 total, discovered dynamically):
+| Kantoor     | minr      | Digitized invnrs | Total invnrs |
+|-------------|-----------|------------------|--------------|
+| Goes        | 33439946  | 990              | 1,109        |
+| Hulst       | 33439947  | TBD              | TBD          |
+| Colijnsplaat/Kortgene | 33439948 | TBD       | TBD          |
+| Middelburg  | 33439949  | TBD              | TBD          |
+| Oostburg    | 33439950  | TBD              | TBD          |
+| Tholen      | 33439951  | TBD              | TBD          |
+| Veere       | 33439952  | TBD              | TBD          |
+| Vlissingen  | 33439953  | TBD              | TBD          |
+| Zierikzee   | 33439954  | TBD              | TBD          |
+**Caches**:
+- ``scans/zeeland/kantoren.json`` – discovered kantoor entries with minr values
+- ``scans/zeeland/tokens_minr_{minr}.json`` – per-page tokens for one kantoor
+- ``scans/zeeland/tokens_minr_{minr}_partial.json`` – incremental save (crash-resilient)
+**Resume**: ``scans/zeeland/done.txt`` tracks completed kantoren.
+Partial token caches allow resuming interrupted harvest runs.
+**Smoke test** (2026-05-11): Goes invnr 1 → 327 pages, invnr 2 → 373 pages.
+Downloads at ``format=large`` PNG (673×1024, ~300KB–950KB per page).
+### Gelderland (Gelders Archief) – per-kantoor MAIS code
+First-time setup: ``uv sync && uv run playwright install chromium``
+Unlike every other MAIS instance in the project, the Gelders Archief gives
+**each kantoor its own archief-code**. Twenty-one kantoren are hardcoded in
+``KANTOREN`` (resolved 2026-05-11 from the kantoor permalinks listed at
+``https://www.geldersarchief.nl/informatie/zoekhulp/997-memories-van-successie``):
+| Kantoor     | Code  | Kantoor     | Code  | Kantoor     | Code  |
+|-------------|-------|-------------|-------|-------------|-------|
+| Arnhem      | 0021  | Elst        | 0028  | Tiel        | 0026  |
+| Apeldoorn   | 0092  | Groenlo     | 0029  | Wageningen  | 0036  |
+| Borculo     | 0022  | Harderwijk  | 0030  | Winterswijk | 0223  |
+| Culemborg   | 0023  | Hattem      | 0031  | Zaltbommel  | 0037  |
+| Doesburg    | 0024  | Lochem      | 0032  | Zevenaar    | 0221  |
+| Druten      | 0025  | Nijkerk     | 0033  | Zutphen     | 0222  |
+| Elburg      | 0027  | Nijmegen    | 0034  |             |       |
+|             |       | Terborg     | 0035  |             |       |
+Inside each kantoor's inv2 tree there are normally two top-level openinv items:
+1. *Register IV, akten van het recht van successie en van overgang …* – the
+   actual Memories van Successie.  Scraper keeps this.
+2. *Tafel VI, alfabetische index … en Tafel V-bis, …* – Tafel V-bis is
+   excluded per the project-wide rule, so we filter any top-level openinv
+   whose text contains "tafel", "v-bis", or "5bis".
+Below Register IV the records are grouped by 5-year periods ("Akten,
+1818-1825.", "Akten, 1826-1830.", …).  Each period eventually contains the
+leaf inventarisnummers ("1  1818", "140  1895 eerste kwartaal", …).
+Digitized leaves carry an ``h_scan.gif`` icon in their tree row; leaves whose
+text starts with ``^\d+\s`` and that have the marker are kept.
+Scans are accessed by navigating to each leaf invnr's inv2 page; the
+thumbnail strip auto-loads (25 thumbs initially) and remaining chunks are
+force-loaded via ``mi_strip_store[…].populate()`` exactly as the Zeeland
+scraper does.
+**Image URL format:**
+```
+https://preserve2.archieven.nl/mi-37/fonc-gea/{code}/{invnr}/
+    {invnr}-{page:04d}.jp2
+    ?format=large&miadt=37&miahd={miahd}&mivast=37&rdt={rdt}&open={token}
+```
+Note the unusual filename convention: the file is named after the
+inventarisnummer (``{invnr}-{page:04d}.jp2``), not a fixed archive
+identifier.  The path itself also contains ``{invnr}`` between the code and
+filename.  ``?format=large`` returns a 1024-pixel-tall PNG (~500 KB/page);
+the full-resolution JP2 is only reachable via IIPSrv tile-server requests
+that would require an extra viewer load per page (~tens of thousands of
+extra requests project-wide), so ``format=large`` is the practical maximum
+here.
+**Caches**:
+- ``scans/gelderland/inventory_{code}.json`` – discovered leaf invnrs for one
+  kantoor: ``[{invnr, text, minr, hasScan}, …]``
+- ``scans/gelderland/tokens_{code}.json`` – per-page tokens for one kantoor
+- ``scans/gelderland/tokens_{code}_partial.json`` – incremental save written
+  every 25 invnrs so a crash mid-harvest doesn't lose work
+**Resume**: ``scans/gelderland/done.txt`` tracks completed kantoor codes.
+**Smoke test** (2026-05-11): Borculo (code 0022) end-to-end – 49 digitized
+invnrs discovered, invnr 1 ("1 1818 eerste halfjaar") → 38 pages, full-size
+download = 1024×858 PNG (~570 KB).  Tafel-only kantoor sections are
+automatically skipped at the Register-IV selection step.

memories_crawl-0.2.0/GUIDE.md ADDED Viewed

@@ -0,0 +1,64 @@
+# Guide: How this project downloads Memories van Successie
+## What are we downloading?
+When someone died in the Netherlands between 1806 and 1927, their heirs had to register the estate with the local tax office. These registers — *Memories van Successie* — list the deceased's name, date and place of death, their heirs, and what the estate was worth. They are a unique source for family history.
+About 150,000 register volumes survive across ten regional archives. Each volume contains anything from a few dozen to over a thousand handwritten pages. This project downloads all of them — scans and metadata — so researchers can work with them offline.
+## Why not just browse the websites?
+The scans exist online. But every archive uses a different website, a different viewer, and a different way of organising its data. There is no "download all" button. Some archives let you view one page at a time through a clunky in-browser viewer. Others hide the images behind JavaScript that only runs when you click through their interface. None of them expose a simple list of links you can hand to a download tool.
+That is what these scripts do: they automate the clicking, the waiting, the page-stepping, and the URL-collecting — tasks a human could do by hand, but which would take months of repetitive work.
+## The two kinds of archive software
+Most Dutch archives run one of two commercial platforms to serve their scans online.
+**MAIS** (by De Ree) is a tree-based inventory viewer. You expand folders in a hierarchy — archive → section → register → page — and a thumbnail strip loads at the bottom. The full-size images are protected by per-page tokens that expire. The browser gets these tokens from JavaScript code that runs when you click on a thumbnail; you cannot simply copy a URL and come back later. Six of the ten archives use MAIS. Because MAIS is configurable, each archive's tree has a slightly different shape, so each needs its own script — but the underlying trick (launch a headless browser, simulate clicks, harvest the URLs that appear) is the same.
+**Memorix** (by Picturae) is a REST API behind a search portal. You send a query (e.g. "give me all registers labelled *memorie van successie*") and get back structured data — metadata, file URLs, people linked to deeds. This is much easier to work with than MAIS because the data is machine-readable from the start. Three of the ten archives (Drenthe, BHIC, Tresoar) expose their collections through Memorix.
+The Nationaal Archief uses neither system and has its own custom viewer.
+## Why three Memorix pipelines look different
+Although Drenthe, BHIC, and Tresoar all run Memorix, they attach scans at different levels of the data model:
+- **Drenthe** puts scans on individual *deeds* (one entry in a register). Each dead person gets their own folder.
+- **BHIC** puts scans on the *register* (the bound book). All pages of the book download into one folder, and a separate `deeds.json` sidecar lists every entry inside it with names and dates.
+- **Tresoar** (Friesland) also puts scans at the deed level, but the deeds are linked to *persons*, so the pipeline creates one folder per person. The images are JPEG 2000 files rather than standard JPEGs.
+These differences exist because each archive chose its own digitisation workflow — some scanned whole books, others scanned individual entries, and the metadata linking was done differently each time.
+## The Playwright part
+For the six MAIS archives, the scripts use a tool called Playwright. It launches a real Chromium browser (the same engine inside Google Chrome) but invisibly, without a window. The script tells this browser: "go to this page, wait for the tree to load, click every expand button, then collect every image URL you see." The gathered URLs — each containing a fresh authentication token — are saved to disk so the slow browser step only runs once. After that, a standard downloader fetches all the images.
+## What you end up with
+```
+scans/
+├── friesland/Sneek/1234/Pieter_Janssen_abc123/
+│   ├── metadata.json
+│   └── 0001.jp2 … 0024.jp2
+├── bhic/Den_Bosch/deel_5678/
+│   ├── metadata.json
+│   ├── deeds.json
+│   └── DenBosch_044_0001.jpg …
+├── overijssel/Zwolle/9012/
+│   ├── metadata.json
+│   └── 0000.jpg … 0127.jpg
+…
+```
+Every folder gets a `metadata.json` sidecar with the archive name, inventory number, kantoor (tax district), name of the deceased (where available), and the original web URL. From there you can browse, search, or feed the collection into other tools.
+## What this project does not do
+It does not transcribe handwriting, index names, or turn the scans into searchable text. It only downloads what the archives already published online — just in a form you can actually work with.
+---
+The scripts live in `src/memories_crawl/`. Each file covers one archive. Run `memories-crawl all` to download everything, or pick individual archives with e.g. `memories-crawl bhic`. See `README.md` for the full command list and setup instructions.