PyPI - structflo-cser - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

structflo-cser 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (97) hide show

{structflo_cser-0.2.0 → structflo_cser-0.3.0}/.gitignore RENAMED Viewed

@@ -44,6 +44,9 @@ archive/
 # ── Project: datasets & generated data ───────────────────────────────────────
 /data/
+# ── Unpublished paper drafts (confidential — keep off git, code only) ────────
+docs/*draft*.md
 # ── Model weights ─────────────────────────────────────────────────────────────
 # Weights are published to HF Hub — never commit .pt files directly.
 # Exception: commit weights.py (the registry) but not the binaries.

structflo_cser-0.3.0/CLAUDE.md ADDED Viewed

@@ -0,0 +1,210 @@
+# CLAUDE.md — structflo-cser
+## What this project does
+Chemical Structure-Label pair Extraction and Recognition (CSER) from scientific document pages.
+Given a PDF or image of a chemistry paper/patent, the pipeline detects chemical structure drawings
+and their compound labels (e.g. "CHEMBL12345", "Compound 1a"), pairs them, then extracts SMILES
+strings and OCR text.
+## Package name & layout
+- **PyPI package**: `structflo-cser`
+- **Top-level packages** (wheel): `structflo`, `annotate`
+- **Source root**: `structflo/cser/` — all library code lives here
+- **Annotate tool**: `annotate/` — Flask web app for manual bbox annotation
+### Module map
+```
+structflo/cser/
+  config.py           PageConfig dataclass (A4@300DPI defaults, slide layouts)
+  _geometry.py         Pure bbox utilities (clamp, intersect, placement)
+  weights.py           HF Hub weight registry + auto-download (resolve_weights)
+  __init__.py          Package version
+  data/
+    smiles.py          Fetch/load SMILES from ChEMBL CSV
+    distractor_images.py  Download/load distractor images for training data
+  rendering/
+    chemistry.py       RDKit 2D structure rendering to PIL images
+    text.py            Label text rendering (random compound IDs, fonts, rotation)
+  distractors/
+    charts.py          Synthetic chart/figure distractors
+    shapes.py          Geometric shapes for hard negatives
+    text_elements.py   Prose blocks, captions, stray text
+  generation/
+    page.py            Core page compositor (place structures + labels + distractors)
+    dataset.py         Dataset generation orchestrator (multiprocessing, YOLO label export)
+    specialty.py       Specialty layouts (SAR tables, MMP sheets, data cards)
+    tabular.py         Excel-style and grid compound layouts
+  training/
+    trainer.py         YOLO11l training wrapper (AdamW, cosine LR, grayscale augmentation)
+  inference/
+    detector.py        YOLO inference (tiled + full-image), visualisation
+    tiling.py          Sliding-window tile generation
+    nms.py             Greedy NMS for merging tiled detections
+    pairing.py         Hungarian matching on centroid distance
+  lps/                 Learned Pair Scorer (replaces Euclidean matching)
+    features.py        14-dim geometric features + visual crop extraction
+    scorer.py          PairScorer CNN (~557K params): struct_crop + label_crop + geom → logit
+    matcher.py         LearnedMatcher (BaseMatcher impl using PairScorer + Hungarian)
+    dataset.py         LPS training dataset (positive/negative pair sampling from GT)
+    train.py           LPS training loop
+    evaluate.py        LPS evaluation script
+  pipeline/
+    models.py          Core dataclasses: BBox, Detection, CompoundPair
+    matcher.py         BaseMatcher ABC + HungarianMatcher
+    ocr.py             BaseOCR ABC + EasyOCRExtractor + NullOCR
+    smiles_extractor.py  BaseSmilesExtractor ABC + DecimerExtractor + NullSmilesExtractor
+    pipeline.py        ChemPipeline: detect → match → enrich (main public API)
+    cli.py             sf-extract CLI entry point
+  viz/
+    labels.py          Visualise YOLO label files on synthetic pages
+    detections.py      Matplotlib plots for Detection/CompoundPair objects
+annotate/
+  __main__.py          Flask annotation tool entry point
+  server.py            Flask routes
+  pdf.py               PDF page rendering for annotation
+  storage.py           Annotation JSON storage
+  templates/           HTML templates
+```
+## CLI entry points (registered in pyproject.toml)
+| Command                   | Module                                    | Purpose                              |
+|---------------------------|-------------------------------------------|--------------------------------------|
+| `sf-generate`             | `structflo.cser.generation.dataset:main`  | Generate synthetic training data     |
+| `sf-train`                | `structflo.cser.training.trainer:main`    | Train YOLO11l detector               |
+| `sf-detect`               | `structflo.cser.inference.detector:main`  | Run detection on images              |
+| `sf-extract`              | `structflo.cser.pipeline.cli:main`        | Full pipeline: detect+match+extract  |
+| `sf-viz`                  | `structflo.cser.viz.labels:main`          | Visualise YOLO labels on images      |
+| `sf-fetch-smiles`         | `structflo.cser.data.smiles:main`         | Download SMILES from ChEMBL          |
+| `sf-download-distractors` | `structflo.cser.data.distractor_images:main` | Download distractor images        |
+| `sf-annotate`             | `annotate.__main__:main`                  | Manual annotation web tool           |
+| `sf-train-lps`            | `structflo.cser.lps.train:main`           | Train Learned Pair Scorer            |
+| `sf-eval-lps`             | `structflo.cser.lps.evaluate:main`        | Evaluate LPS model                   |
+## Key public API
+```python
+from structflo.cser.pipeline import ChemPipeline
+pipeline = ChemPipeline()                    # auto-downloads weights
+pairs = pipeline.process("page.png")         # detect → match → enrich
+pairs = pipeline.process_pdf("paper.pdf")    # per-page processing
+# Low-level access
+detections = pipeline.detect(image)
+pairs = pipeline.match(detections, image=image)
+pairs = pipeline.enrich(pairs, image)
+# Output
+ChemPipeline.to_json(pairs)
+ChemPipeline.to_dataframe(pairs)
+ChemPipeline.to_records(pairs)
+```
+## Detection model
+- **Architecture**: YOLO11l (ultralytics)
+- **Classes**: 2 — `chemical_structure` (0), `compound_label` (1)
+- **Training image size**: 1280px
+- **Inference**: sliding-window tiling (1536px tiles, 20% overlap) + per-class NMS
+- **Training config**: AdamW, cosine LR, grayscale images, no colour augmentation
+- **Runs directory**: `runs/labels_detect/`
+- **YOLO data config**: `config/data.yaml`
+## Matching strategies
+1. **HungarianMatcher** — centroid Euclidean distance + `scipy.optimize.linear_sum_assignment`
+2. **LearnedMatcher** (LPS) — CNN scorer produces association probability per (struct, label) pair,
+   then Hungarian on `1 - score`. Default in ChemPipeline.
+## Weights system
+Weights are versioned independently of the package and stored on HuggingFace Hub.
+`structflo.cser.weights.resolve_weights(model, version)` handles auto-download + caching.
+| Model          | HF Repo                         | Latest |
+|----------------|----------------------------------|--------|
+| cser-detector  | sidxz/structflo-cser-detector   | v0.2   |
+| cser-lps       | sidxz/structflo-cser-lps        | v0.1   |
+Publish script: `scripts/publish_weights.py`
+## Fine-tuning on real data
+Scripts live in `scripts/finetune/{yolo,lps}/`, each with `prepare_data.py`, `train.sh`, `eval_compare.py`.
+### Data layout
+- **Real annotations**: produced by `sf-annotate`, stored externally (symlinked in)
+- **Combined data**: `data/finetune/{yolo,lps}/` — symlinks mixing subsampled synthetic + oversampled real
+- Knobs at top of each `prepare_data.py`: `N_SYNTH_TRAIN`, `N_SYNTH_VAL`, `REAL_OVERSAMPLE`, `N_REAL_VAL`
+### YOLO fine-tune
+- Starts from `runs/labels_detect/yolo11l_panels/weights/best.pt`
+- Output: `runs/labels_detect/finetune_trial/weights/best.pt`
+- Lower LR (1e-4), short warmup (1 epoch), 10 epochs default
+### LPS fine-tune
+- Uses `sf-train-lps --finetune <checkpoint>` (loads weights only, fresh optimizer/scheduler)
+- Distinct from `--resume` which restores full training state (optimizer, scheduler, epoch)
+- Starts from `runs/lps/best.pt`, output: `runs/lps_finetune/best.pt`
+### Eval
+- `eval_compare.py` runs both baseline and fine-tuned on two val sets (finetune val + original synthetic val)
+- Prints summary table with deltas and a verdict (improvement vs regression)
+### Publishing fine-tuned weights
+```bash
+python scripts/publish_weights.py --model cser-detector --version vX.Y \
+    --weights-file runs/labels_detect/finetune_trial/weights/best.pt
+python scripts/publish_weights.py --model cser-lps --version vX.Y \
+    --weights-file runs/lps_finetune/best.pt
+```
+## Synthetic data generation
+- Pages: A4@300DPI (2480x3508) as JPEG, also slide layouts (16:9)
+- Layout types: free-form (~30%), Excel tables (~14%), grids (~12%), SAR tables (~8%),
+  MMP sheets (~7%), data cards (~8%), slides (~13%), hard negatives (~8%)
+- Structures rendered via RDKit from ChEMBL SMILES
+- Labels: random compound IDs in various styles (CHEMBL, ZINC, Roman numerals, etc.)
+- Noise augmentation: JPEG artifacts, blur, brightness, Gaussian noise
+- Output: images + YOLO .txt labels + ground truth JSON (per-compound struct/label bboxes + SMILES)
+- Default: 2000 train / 200 val pages, multiprocessing with all CPUs
+## Build & dev
+```bash
+uv sync --dev              # install all deps
+uv run ruff check structflo/ tests/   # lint
+uv run ruff format structflo/ tests/  # format
+uv run pytest -q           # tests
+uv build                   # build wheel
+```
+- **Python**: >=3.11 (project uses 3.12)
+- **Build system**: hatchling + hatch-vcs (version from git tags)
+- **Linting**: ruff
+- **Tests**: pytest (tests/ directory)
+- **CI**: GitHub Actions — lint + format check + pytest + coverage on push/PR to main
+- **PyPI publish**: on git tag `v*`
+## Conventions
+- All images converted to grayscale before detection (matches training distribution)
+- Adapters pattern: `BaseMatcher`, `BaseOCR`, `BaseSmilesExtractor` ABCs for swappable components
+- Lazy model loading throughout (YOLO, EasyOCR, DECIMER loaded on first use)
+- Weights never committed to git (*.pt in .gitignore), only on HF Hub
+- `runs/`, `data/`, `detections/`, `archive/` are gitignored

{structflo_cser-0.2.0 → structflo_cser-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: structflo-cser
-Version: 0.2.0
+Version: 0.3.0
 Summary: Chemical structure-label pair extraction from scientific documents.
 Requires-Python: >=3.11
 Requires-Dist: chembl-webresource-client>=0.10.9

{structflo_cser-0.2.0 → structflo_cser-0.3.0}/annotate/storage.py RENAMED Viewed

@@ -12,7 +12,8 @@ Ground-truth JSON schema (pair format):
     ]
 YOLO .txt (written only when pairs are non-empty):
-    0  cx  cy  w  h   (normalised 0-1; class 0 = compound_panel = union bbox)
+    0  cx  cy  w  h   (normalised 0-1; class 0 = chemical_structure)
+    1  cx  cy  w  h   (normalised 0-1; class 1 = compound_label)
 Annotation states:
     - GT JSON absent  → page not yet visited
@@ -47,7 +48,7 @@ def save(page_id: str, pairs: list[dict], img_w: int, img_h: int,
     GT JSON is *always* written (even for empty pages) so the page is
     tracked as 'done'.  YOLO .txt is only written when pairs are present.
-    YOLO bounding box = union of struct_bbox and label_bbox (class 0).
+    YOLO labels: class 0 = chemical_structure, class 1 = compound_label.
     """
     gt_dir = output_dir / "ground_truth"
     gt_dir.mkdir(parents=True, exist_ok=True)
@@ -75,14 +76,17 @@ def save(page_id: str, pairs: list[dict], img_w: int, img_h: int,
             s = pair["struct_bbox"]          # [x1, y1, x2, y2]
             l = pair.get("label_bbox")       # [x1, y1, x2, y2] or None
+            # class 0 = chemical_structure
+            sx1, sy1, sx2, sy2 = s
+            f.write(
+                f"0 {(sx1 + sx2) / 2 / img_w:.6f} {(sy1 + sy2) / 2 / img_h:.6f} "
+                f"{(sx2 - sx1) / img_w:.6f} {(sy2 - sy1) / img_h:.6f}\n"
+            )
+            # class 1 = compound_label
             if l:
-                x1 = min(s[0], l[0]);  y1 = min(s[1], l[1])
-                x2 = max(s[2], l[2]);  y2 = max(s[3], l[3])
-            else:
-                x1, y1, x2, y2 = s
-            cx = (x1 + x2) / 2 / img_w
-            cy = (y1 + y2) / 2 / img_h
-            w  = (x2 - x1) / img_w
-            h  = (y2 - y1) / img_h
-            f.write(f"0 {cx:.6f} {cy:.6f} {w:.6f} {h:.6f}\n")
+                lx1, ly1, lx2, ly2 = l
+                f.write(
+                    f"1 {(lx1 + lx2) / 2 / img_w:.6f} {(ly1 + ly2) / 2 / img_h:.6f} "
+                    f"{(lx2 - lx1) / img_w:.6f} {(ly2 - ly1) / img_h:.6f}\n"
+                )

structflo_cser-0.3.0/docs/fine-tune.md ADDED Viewed

@@ -0,0 +1,115 @@
+  # YOLO
+  uv run python scripts/finetune/yolo/prepare_data.py
+  bash scripts/finetune/yolo/train.sh
+  uv run python scripts/finetune/yolo/eval_compare.py
+  # LPS
+  uv run python scripts/finetune/lps/prepare_data.py
+  bash scripts/finetune/lps/train.sh
+  uv run python scripts/finetune/lps/eval_compare.py
+  Eval now checks two things per model:
+  1. Finetune val (50 synth + 2 real) — did real data help?
+  2. Original synthetic val (2000 pages) — did fine-tuning regress?
+  To publish if results look good:
+  python scripts/publish_weights.py --model cser-detector --version v0.3 \
+      --weights-file runs/labels_detect/finetune_trial/weights/best.pt
+  python scripts/publish_weights.py --model cser-lps --version v0.2 \
+      --weights-file runs/lps_finetune/best.pt
+ Based on the current defaults in scripts/finetune/*/prepare_data.py, here's what to scale up:
+  ┌─────────────────┬─────────────┬────────────────────────────┬────────────────────────────────────────────────────┐
+  │      Param      │ Trial (now) │ Full dataset (~100+ pages) │                        Why                         │
+  ├─────────────────┼─────────────┼────────────────────────────┼────────────────────────────────────────────────────┤
+  │ N_SYNTH_TRAIN   │ 200         │ 2000–5000                  │ More synthetic to prevent forgetting               │
+  ├─────────────────┼─────────────┼────────────────────────────┼────────────────────────────────────────────────────┤
+  │ N_SYNTH_VAL     │ 50          │ 200–500                    │ More reliable val metrics                          │
+  ├─────────────────┼─────────────┼────────────────────────────┼────────────────────────────────────────────────────┤
+  │ REAL_OVERSAMPLE │ 10          │ 3–5                        │ Less oversampling needed since you have more pages │
+  ├─────────────────┼─────────────┼────────────────────────────┼────────────────────────────────────────────────────┤
+  │ N_REAL_VAL      │ 2           │ 10–20% of total            │ Meaningful real val set                            │
+  └─────────────────┴─────────────┴────────────────────────────┴────────────────────────────────────────────────────┘
+  The goal is to keep real data at roughly 30–50% of training. With 100 real pages at 3x oversample + 2000 synthetic, that's 300/(300+2000) = ~13% — bump oversample to 5 for ~20%, or lower N_SYNTH_TRAIN
+  to 1000 for ~33%.
+  For training hyperparams in train.sh:
+  - Epochs: trial uses 10, full run can stay at 10–15 (more data per epoch means less risk of overfitting)
+  - LR: keep as-is (1e-4 YOLO, 3e-4 LPS) — these are already conservative
+With 500 annotated pages, here are the recommended values and what each param does:
+  prepare_data.py params
+  N_SYNTH_TRAIN (trial: 200 → recommended: 3000–5000)
+  Number of synthetic images randomly sampled into the training set. Synthetic data acts as a regularizer — it prevents the model from overfitting to quirks of your specific papers (particular fonts, DPI,
+   layout style). Too few and the model forgets synthetic-learned features; too many and real data gets drowned out. With 500 real pages, 3000–5000 synthetic keeps the ratio healthy.
+  N_SYNTH_VAL (trial: 50 → recommended: 300–500)
+  Synthetic images in the validation set. More gives you stable metrics — with only 50, a single noisy page can swing mAP by several points. 300+ makes the regression check trustworthy.
+  REAL_OVERSAMPLE (trial: 10 → recommended: 2–3)
+  Each real image gets this many symlink copies with unique names, so YOLO/LPS treats them as separate training examples. With only 14 pages you needed 10x to make real data visible. With 500 pages, 2–3x
+  is enough. At 3x: 1500 real / (1500 + 4000 synth) = 27% of training. Too high and the model memorizes your annotation set instead of generalizing.
+  N_REAL_VAL (trial: 2 → recommended: 50–75)
+  Real pages held out for validation (never seen during training). This is your ground truth for "did fine-tuning actually help on real documents?". At 50 pages you get reliable per-class metrics. The
+  remaining 425–450 go to training.
+  train.sh params (YOLO)
+  epochs=10 → recommended: 10–15
+  One full pass over all training data. More epochs = more chances to learn, but with 500 real pages you have enough data per epoch that 10–15 is sufficient. Early stopping (patience=5) will halt if val
+  metrics plateau anyway.
+  lr0=1e-4 → keep as-is
+  Starting learning rate, 10x lower than from-scratch training (1e-3). Low LR is critical for fine-tuning — too high and you destroy the features learned from 20K synthetic pages. Too low and you never
+  adapt. 1e-4 is a standard fine-tune rate.
+  warmup_epochs=1 → keep as-is
+  Epochs where LR ramps from ~0 to lr0. Prevents large early gradient updates that could destabilize the pretrained weights. 1 is enough since we're already starting with a well-trained model.
+  patience=5 → keep as-is
+  Stop training if val mAP doesn't improve for this many epochs. Prevents overfitting if the model converges early.
+  train.sh params (LPS)
+  --lr 3e-4 → keep as-is
+  Same logic — lower than from-scratch (1e-3) but the LPS model is tiny (557K params) so it can tolerate a slightly higher fine-tune LR than YOLO.
+  --epochs 10 → recommended: 10–15
+  --batch 512 → keep as-is
+  Batch size. LPS samples are small (crops + 14-dim features), so 512 fits easily in memory and gives stable gradient estimates.
+  Concrete config for 500 pages
+  # prepare_data.py (both yolo and lps)
+  N_SYNTH_TRAIN = 4000
+  N_SYNTH_VAL = 400
+  N_REAL_VAL = 50
+  REAL_OVERSAMPLE = 3
+  That gives you:
+  - Train: 4000 synth + 450×3 = 5350 images (25% real)
+  - Val: 400 synth + 50 real = 450 images
+  Training hyperparams stay the same — they're already tuned for fine-tuning.

structflo-cser 0.2.0__tar.gz → 0.3.0__tar.gz

structflo-cser 0.2.0tar.gz → 0.3.0tar.gz