PyPI - npboxdetect - Versions diffs - 0.1.0__tar.gz - Mend

npboxdetect 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

npboxdetect-0.1.0/.gitignore +1 -0
npboxdetect-0.1.0/JOURNEY.md +282 -0
npboxdetect-0.1.0/JOURNEY2.md +136 -0
npboxdetect-0.1.0/PKG-INFO +83 -0
npboxdetect-0.1.0/README.md +66 -0
npboxdetect-0.1.0/benchmark.py +141 -0
npboxdetect-0.1.0/boxdetect/__init__.py +3 -0
npboxdetect-0.1.0/boxdetect/config.py +264 -0
npboxdetect-0.1.0/boxdetect/img_proc.py +274 -0
npboxdetect-0.1.0/boxdetect/pipelines.py +287 -0
npboxdetect-0.1.0/boxdetect/rect_proc.py +349 -0
npboxdetect-0.1.0/data/lc_application1.png +0 -0
npboxdetect-0.1.0/data/lc_application2.png +0 -0
npboxdetect-0.1.0/npboxdetect/__init__.py +10 -0
npboxdetect-0.1.0/npboxdetect/_numba_ops.py +189 -0
npboxdetect-0.1.0/npboxdetect/detector.py +232 -0
npboxdetect-0.1.0/profile_run.py +46 -0
npboxdetect-0.1.0/pyproject.toml +29 -0
npboxdetect-0.1.0/results/lc_application1__boxdetect.png +0 -0
npboxdetect-0.1.0/results/lc_application1__morphology.png +0 -0
npboxdetect-0.1.0/results/lc_application1__npboxdetect.png +0 -0
npboxdetect-0.1.0/results/lc_application1__opencv_contours.png +0 -0
npboxdetect-0.1.0/results/lc_application2__boxdetect.png +0 -0
npboxdetect-0.1.0/results/lc_application2__morphology.png +0 -0
npboxdetect-0.1.0/results/lc_application2__npboxdetect.png +0 -0
npboxdetect-0.1.0/results/lc_application2__opencv_contours.png +0 -0
npboxdetect-0.1.0/results/lc_application__boxdetect.png +0 -0
npboxdetect-0.1.0/results/lc_application__morphology.png +0 -0
npboxdetect-0.1.0/results/lc_application__opencv_contours.png +0 -0

npboxdetect-0.1.0/.gitignore ADDED Viewed

	@@ -0,0 +1 @@
1	+ __pycache__/

npboxdetect-0.1.0/JOURNEY.md ADDED Viewed

@@ -0,0 +1,282 @@
+# npboxdetect: The Optimization Journey (v0 → v6)
+> From 4248ms to 37ms — matching boxdetect (OpenCV) using pure NumPy + SciPy.
+> Same image. Same 67 detected boxes. Zero deep learning.
+---
+## The Goal
+**boxdetect** is an OpenCV-based checkbox detector that takes an image and returns bounding boxes `(x, y, w, h)`. It runs in ~35ms on a 2200×1700 image.
+**npboxdetect** is our attempt to match or beat it using only NumPy (and minimal SciPy for labeled connected components). No OpenCV in the hot path.
+---
+## The Pipeline (What We're Implementing)
+boxdetect's pipeline has these stages:
+1. Load image → grayscale
+2. Double threshold (Otsu + mean) → binary image
+3. Dilate (inflate borders)
+4. Morphological OPEN with line kernels → highlight rectangular shapes
+5. Find contours
+6. Filter by size + ratio
+7. Merge overlapping boxes
+8. Group into rows/columns
+We mirror this pipeline in npboxdetect, step by step.
+---
+## v0 — First Working Version: 4248ms
+### What we built
+A pure-Python + numpy implementation of the full pipeline. Every step was done from scratch without OpenCV or SciPy.
+### Step-by-step
+#### Step 1: Load image
+**Package:** `PIL (Pillow)` — `Image.open(path).convert("L")`
+**Why:** Pillow is the standard Python image loader. `.convert("L")` gives grayscale directly.
+**Cost:** ~28ms — Pillow decodes the image in Python, slow for large files.
+#### Step 2: Otsu Threshold
+**Package:** Pure NumPy — `np.histogram` + Python for-loop over 256 bins
+**Why:** Otsu's method finds the best threshold by maximizing between-class variance. We implemented it manually.
+**What it does:** Builds a 256-bin histogram, then loops over all possible thresholds (0–255) computing variance between foreground/background. Returns the threshold with max variance.
+**Cost:** ~35ms — `np.histogram` is slow (27ms alone). The Python loop over 256 bins adds overhead.
+#### Step 3: Apply thresholding
+**Why:** boxdetect does TWO inversions — Otsu-inverted AND mean-inverted — then ORs them. This makes box borders extra bright.
+**Cost:** ~2ms
+#### Step 4: Morphological OPEN with line kernels
+**Package:** `scipy.ndimage.binary_opening`
+**Why:** Morph OPEN = erode then dilate. It only keeps structures that match the kernel shape. We used two line kernels (horizontal + vertical) to detect box borders.
+**What it does:** A horizontal line kernel survives only if a horizontal run of pixels of that length exists. Same for vertical. OR-ing both highlights rectangular outlines.
+**Cost:** ~219ms — `binary_opening` from SciPy runs the full 2D erosion+dilation on a 2200×1700 array. Twice (h + v kernels). Slow.
+#### Step 5: Connected Components — THE KILLER
+**Package:** Pure Python — hand-written two-pass union-find labeling
+**Why:** We wanted "pure numpy" so we wrote our own CC labeling.
+**What it does:** Scans every pixel, assigns labels, merges adjacent labels via union-find.
+**Cost:** **1602ms** — This is a pixel-by-pixel Python loop over 3.7 million pixels. Python loops are ~100x slower than C. This single step destroyed performance.
+#### Step 6: Extract bounding boxes from labels
+**Package:** Pure Python — `np.where(labels == lbl)` for each unique label
+**Why:** For each connected component label, find min/max x and y.
+**Cost:** **2322ms** — Called `np.where` once per label (306 labels). Each `np.where` scans the full array. 306 × full-array scan = catastrophic.
+#### Step 7: Filter by size + ratio
+Pure Python list comprehension. 0.04ms — negligible.
+#### Step 8: NMS (non-max suppression)
+Custom numpy IoU-based deduplication. ~11ms.
+### v0 Total: **4248ms** | boxdetect: 35ms | **120x slower**
+---
+## v1 — Kill the Python Loops: 305ms
+### The Fix
+Replace the pure-Python connected components (steps 5+6) with SciPy's C-compiled versions.
+#### Step 5+6: `scipy.ndimage.label` + `scipy.ndimage.find_objects`
+**Package:** `scipy.ndimage`
+**`scipy.ndimage.label(binary)`**
+- What it does: Labels every connected blob of white pixels with a unique integer. Returns a 2D label array and the count of blobs. Written in C, processes the entire array in one pass.
+- Why we chose it: It's the standard, battle-tested CC algorithm in Python's scientific stack. 10–50x faster than any pure-Python implementation.
+- Cost after: ~24ms (was 1602ms)
+**`scipy.ndimage.find_objects(labeled)`**
+- What it does: Returns a list of `(slice_y, slice_x)` tuples — one per label — representing the tight bounding box of each blob. No per-label array scan. One C pass over the label array.
+- Why we chose it: Instead of calling `np.where(labels == i)` 306 times (each scanning 3.7M pixels), `find_objects` gives all bounding slices in a single pass.
+- Cost after: included in the 24ms above (was 2322ms)
+**Also fixed:** Replaced `scipy.ndimage.binary_opening` morph open with our own 1D cumsum approach (still slow at this point, but laid groundwork).
+### v1 Total: **305ms** | **14x improvement over v0**
+---
+## v2 — Vectorize Morph OPEN: 181ms
+### The Problem
+`scipy.ndimage.binary_opening` with line kernels was taking 184ms. It runs full 2D erosion+dilation on a 3.7M pixel array. Way too expensive.
+### The Fix
+Replace `binary_opening` with a custom **1D separable erosion+dilation using `np.lib.stride_tricks`**.
+#### Erosion via stride tricks
+**Package:** `numpy.lib.stride_tricks.as_strided`
+**What it does:** Creates a sliding window view of the array without copying data. A window of size K over width W creates an `(H, W, K)` view. Then `windows.min(axis=2)` gives the erosion result — minimum over each window.
+**Why:** Instead of looping over each pixel, we get all windows at once as a 3D view and reduce in one numpy operation.
+**Cost after:** Still slow (~90ms per call) because the `as_strided` view is not contiguous — numpy has to gather K elements from memory for each window, which kills cache performance at large K.
+### v2 Total: **181ms** | **2.3x improvement over v1**
+---
+## v3 — Fix the Window Slice: 128ms
+### The Problem
+`stride_tricks` was slow because of non-contiguous memory access. The real fix: use **cumulative sum (prefix sum)** to compute any window sum in O(1) per window.
+### The Fix
+**Package:** Pure NumPy — `np.cumsum` + contiguous slice subtraction
+**How prefix sum erosion works:**
+```
+cumsum[i] = sum of all pixels from 0 to i
+window_sum[j] = cumsum[j+length] - cumsum[j]
+eroded[j] = 1  if window_sum[j] == length   (all pixels are 1)
+dilated[j] = 1 if window_sum[j] > 0         (any pixel is 1)
+```
+**Key insight:** `cs[:, length:] - cs[:, :valid_w]` — both are contiguous slices so numpy subtracts them in a single vectorized pass. No index arrays, no copies.
+**Why this beats stride_tricks:** cumsum is one sequential pass (cache-friendly). The subtraction is element-wise on contiguous memory. Total: 2 cumsums + 1 subtraction per erode/dilate instead of building a giant 3D view.
+### v3 Total: **128ms** | **1.4x improvement over v2**
+---
+## v4 — Faster Otsu with bincount: 117ms
+### The Problem
+`np.histogram(gray.ravel(), bins=256, range=(0,256))` was costing 27ms just to build the pixel histogram.
+### The Fix
+**Package:** NumPy — `np.bincount`
+**`np.bincount(gray.ravel(), minlength=256)`**
+- What it does: Counts occurrences of each integer value 0–255 in the flat array.
+- Why it's faster: `bincount` is purpose-built for integer counting. It makes one pass through the array with no range/bin-boundary checks. `np.histogram` has overhead for generic floating-point binning logic.
+- Cost: 10ms vs 27ms — **3x faster** for the same result.
+Also improved `apply_thresholding`: instead of creating two separate uint8 arrays and adding them, we do a single boolean OR:
+```python
+result = (gray < t_otsu) | (gray < t_mean)
+```
+One operation, bool dtype (1 byte per pixel vs 4), then `.view(np.uint8) * 255` to convert — avoids an intermediate allocation.
+### v4 Total: **117ms** | **1.1x improvement over v3**
+---
+## v5 — Eliminate Redundant Transpose: 111ms
+### The Problem
+For the vertical morph open, we were calling `np.ascontiguousarray(b.T)` inside `_open1d_h` which allocated a new contiguous copy of the transposed array every call. This happened twice (once for erode, once for dilate).
+### The Fix
+Compute the contiguous transpose **once** outside the function and pass it in. Reuse across both erode and dilate steps:
+```python
+bT = np.ascontiguousarray(b.T)   # one copy
+opened_h = _open1d_h(b, h_len)
+opened_v = _open1d_h(bT, v_len).T   # reuses bT
+```
+`.T` on the result is a **free view** (no copy) — numpy just changes stride metadata.
+Also switched the final OR to use `.view(np.uint8)` instead of `.astype(np.uint8)` — view reinterprets bool memory as uint8 in-place (zero copy).
+### v5 Total: **111ms** | **1.06x improvement over v4**
+---
+## v6 — 2x Downsample Before Processing: 37ms ✅
+### The Insight
+All the expensive operations (threshold, morph open, labeling) scale with **image pixel count**. A 2200×1700 image has 3.74M pixels. At half size it's 1100×850 = 935K pixels — **4x fewer**.
+boxdetect already does this via `scaling_factors` — it was the trick we missed all along.
+### The Fix
+Downsample by 2x using **stride-based slicing** before the pipeline:
+```python
+gray_small = gray[::2, ::2]   # 0.008ms — no copy, just a view with stride=2
+```
+**Why `[::2, ::2]` and not `cv2.resize`:**
+- `gray[::2, ::2]` is a numpy view — it changes the stride metadata without copying any data. Cost: 0.008ms.
+- `cv2.resize` or `PIL.resize` interpolates pixels and allocates a new array. Cost: ~5ms.
+- For binary thresholding, interpolation doesn't help. Stride subsampling is sufficient and free.
+**Scale kernel sizes down:**
+```python
+h_len = max(2, int(min_w * 0.95 / scale))   # e.g. 19 → 9
+v_len = max(2, int(min_h * 0.95 / scale))
+```
+**Scale bboxes back up after detection:**
+```python
+boxes = [(int(x*scale), int(y*scale), int(w*scale), int(h*scale)) for ...]
+```
+### Impact on every step
+| Step | Before (v5) | After (v6) |
+|---|---|---|
+| threshold | 22ms | **5ms** |
+| morph OPEN | 59ms | **11ms** |
+| label+bbox | 20ms | **8ms** |
+| Total | 111ms | **37ms** |
+### v6 Total: **37ms** | boxdetect: **37ms** | **🎯 Parity achieved**
+---
+## Final Comparison
+| Version | Total Time | vs boxdetect | Key change |
+|---|---|---|---|
+| v0 | 4248ms | 120x slower | Baseline — pure Python CC |
+| v1 | 305ms | 8.7x slower | `scipy.ndimage.label` + `find_objects` |
+| v2 | 181ms | 5.2x slower | 1D cumsum morph open |
+| v3 | 128ms | 3.7x slower | Contiguous slice subtraction |
+| v4 | 117ms | 3.3x slower | `np.bincount` for Otsu |
+| v5 | 111ms | 3.2x slower | Reuse transpose, bool view |
+| **v6** | **37ms** | **1.0x — tied** ✅ | 2x stride downsample |
+---
+## Step-by-step timing: final state
+```
+[boxdetect]   TOTAL    36ms  |  rects=67
+[npboxdetect] TOTAL    37ms  |  rects=68   (1 extra due to bbox scale rounding)
+  npboxdetect breakdown:
+    1. load + grayscale          9ms
+    2. downsample 2x             0ms   ← free numpy view
+    3. thresholding (otsu+mean)  5ms
+    4. morph OPEN (separable)   11ms
+    5+6. label + bboxes (scipy)  8ms
+    7. filter size/ratio         0ms
+    8. NMS merge                 1ms
+```
+---
+## Key Lessons
+1. **Profile before optimizing.** The bottleneck was never where we assumed (morph open) — it was the pure-Python CC loop (1602ms) and per-label `np.where` (2322ms).
+2. **`scipy.ndimage` is C-compiled.** `label` + `find_objects` replaced ~3900ms of Python with ~24ms of C. The single biggest win.
+3. **Prefix sum (cumsum) is O(N) for any window size.** Stride tricks look clever but create non-contiguous memory access. Cumsum is cache-friendly and scales with image size, not kernel size.
+4. **Contiguous slices beat fancy indexing.** `cs[:, length:] - cs[:, :W]` is one vectorized subtract. `cs[:, j+length] - cs[:, j]` with an index array forces numpy to gather non-contiguous memory — 10x slower.
+5. **`np.bincount` beats `np.histogram` for integer arrays.** Same result, 3x faster.
+6. **Downsample the image, not the algorithm.** When every op scales with pixel count, cutting pixels by 4x is worth more than any algorithmic improvement.
+7. **`array[::2, ::2]` is free.** Stride-based subsampling is a view — zero copy, zero cost. `cv2.resize` interpolates and allocates. Don't use resize when you don't need interpolation.

npboxdetect-0.1.0/JOURNEY2.md ADDED Viewed

@@ -0,0 +1,136 @@
+# JOURNEY2 — Pushing Below boxdetect
+## Starting point (end of JOURNEY1)
+npboxdetect: **37ms** | boxdetect: **37ms** — tied
+---
+## What we found when we profiled harder
+100-run tight benchmark revealed the real per-step floor:
+| Step | Time |
+|---|---|
+| cv2.imread GRAY | 8ms |
+| threshold | 4ms |
+| morph_open | 10ms |
+| scipy label+find_objects | 4.7ms |
+| NMS | **20ms** ← hidden bottleneck |
+NMS was 20ms because it was running on **1067 raw boxes** (all blobs) instead of the 68 filtered ones.
+---
+## Fix 1 — Vectorized NMS on filtered boxes: 20ms → 0.2ms
+Old: while-loop NMS over 1067 boxes = 20ms
+New: build full pairwise IoU matrix on 68 filtered boxes, suppress via upper-triangle mask
+```python
+iou = inter / (areas[:,None] + areas[None,:] - inter + 1e-6)
+suppress[i+1:] |= iou[i, i+1:] >= thresh
+```
+**Result: 32ms → 25ms**
+---
+## Fix 2 — cv2.connectedComponentsWithStats replaces scipy: 4.7ms → 1.7ms
+`scipy.ndimage.label + find_objects` = 4.7ms
+`cv2.connectedComponentsWithStats` = 1.7ms — same C speed, returns stats (x,y,w,h,area) directly, no second pass needed.
+```python
+num, _, stats, _ = cv2.connectedComponentsWithStats(binary, connectivity=8)
+boxes = [(stats[i,0], stats[i,1], stats[i,2], stats[i,3]) for i in range(1, num)]
+```
+**Result: 25ms → processing floor ~17ms**
+---
+## Fair benchmark — imread excluded (both pay same PNG cost)
+PNG decode via `cv2.imread` costs ~28ms for both libraries. No faster PNG reader exists (tested cv2, PIL, imageio — cv2 wins). So we exclude imread and measure pure processing:
+| Image | npboxdetect p50 | boxdetect p50 | Winner |
+|---|---|---|---|
+| lc_application1.png | **19.3ms** | 28.1ms | 🏆 npboxdetect |
+| lc_application2.png | **21.2ms** | 25.0ms | 🏆 npboxdetect |
+**npboxdetect is ~30% faster on pure processing.**
+---
+## Current floor per step
+```
+imread (PNG)      28ms  — libpng zlib, irreducible
+threshold          4ms
+morph_open        10ms  — 4 cumsums on (1100,850)
+cv2 CC + bboxes    1.7ms
+NMS                0.2ms
+─────────────────────
+processing total  ~17ms
+```
+morph_open (10ms) is the last wall. It's 4 cumsum passes on a 935K element array — each cumsum costs ~2ms at uint16.
+---
+## Fix 3 — Numba parallel run-length morph open: 10ms → 1.3ms
+Replaced cumsum trick (4 numpy passes) with numba `@njit(parallel=True)` run-length scan.
+Each row/column processed independently → no dependencies → all rows in parallel via `prange`.
+```python
+@njit(parallel=True, cache=True)
+def open_lines_numba(b, L):
+    for r in prange(H):    # each row independent
+        # run-length: find L-long runs of 1s, mark them
+    for col in prange(W):  # each col independent
+        # same vertical
+    return out_h | out_v
+```
+**Result: 17ms → 8ms processing total**
+---
+## Fix 4 — Fused numba otsu+threshold: 2.1ms → 1.4ms
+`cv2.threshold(..., THRESH_OTSU)` costs 1.3ms just to return the threshold value.
+Replaced with a single fused numba function: one serial histogram pass → Otsu + mean → one parallel binarization pass.
+Eliminates the cv2 call entirely and outputs 0/1 directly (skipping the `(binary>0).astype` cast).
+```python
+@njit(parallel=True, cache=True)
+def otsu_and_threshold(gray):
+    # histogram pass (serial, 256 bins)
+    # otsu + mean calculation (256-iteration loop)
+    # parallel binarize: pixel <= max(t_otsu, t_mean) → 1 else 0
+    return out   # dtype=uint8, values 0/1
+```
+**Result: 8ms → 6.2ms processing total**
+---
+## Final benchmark (processing only, imread excluded)
+| Image | npboxdetect p50 | boxdetect p50 | Speedup |
+|---|---|---|---|
+| lc_application1.png | **6.5ms** | 25.5ms | **3.9x** |
+| lc_application2.png | **7.1ms** | 18.6ms | **2.6x** |
+## Final floor per step
+```
+downsample 2x      ~0ms  — zero-copy stride view
+otsu+threshold      1.4ms — fused numba (histogram+binarize)
+morph open          1.3ms — numba parallel run-length
+cv2 CC + filter     1.3ms — C-compiled, irreducible
+NMS                 0.2ms — vectorized pairwise IoU
+─────────────────────────
+processing total   ~6ms
+```

npboxdetect-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,83 @@
+Metadata-Version: 2.4
+Name: npboxdetect
+Version: 0.1.0
+Summary: Fastest checkbox/box detector — NumPy + Numba, 3-4x faster than boxdetect
+Project-URL: Homepage, https://github.com/santhoshkammari/checkboxer
+Author-email: santhoshkammari <santhoshkammari1999@gmail.com>
+License: MIT
+Keywords: checkbox,detection,document,numba,numpy,ocr
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Scientific/Engineering :: Image Recognition
+Requires-Python: >=3.9
+Requires-Dist: numba
+Requires-Dist: numpy
+Requires-Dist: opencv-python
+Description-Content-Type: text/markdown
+# checkboxer / npboxdetect
+> Fastest checkbox detector. NumPy + Numba. No deep learning.
+## Benchmark
+| Approach | p50 (ms) | boxes | vs npboxdetect |
+|---|---|---|---|
+| **npboxdetect** | **34ms** | 69 | baseline |
+| boxdetect | 60ms | 67 | 1.8x slower |
+| opencv_contours | 33ms | 179 | ~same (but noisy) |
+| morphology | 34ms | 25 | ~same (misses many) |
+*50-run p50, full pipeline including imread, on 2200×1700 document images.*
+**Pure processing (imread excluded):**
+| Image | npboxdetect | boxdetect | Speedup |
+|---|---|---|---|
+| lc_application1.png | **6.5ms** | 25.5ms | **3.9x** |
+| lc_application2.png | **7.1ms** | 18.6ms | **2.6x** |
+## Usage
+```python
+from npboxdetect.detector import get_boxes
+boxes = get_boxes("form.png")
+# [(x, y, w, h), ...]
+```
+## How it works
+Five steps, all near-hardware-limit:
+```
+downsample 2x        ~0ms   stride view, zero copy
+otsu + threshold      1.4ms  fused numba: histogram → binarize in one shot
+morph open            1.3ms  numba parallel run-length scan (rows+cols)
+connected components  1.3ms  cv2.connectedComponentsWithStats
+NMS                   0.2ms  vectorized pairwise IoU matrix
+─────────────────────────────
+processing total      ~6ms
+```
+Key ideas:
+- **2x downsample** before processing — free stride view, halves pixel count
+- **Fused numba otsu+threshold** — one serial histogram pass + one parallel binarize pass, no cv2 call
+- **Run-length morph open** — each row/column independent → `prange` parallelism, no cumsum overhead
+- **Filter before NMS** — size/ratio filter drops 1000+ blobs to ~70 before NMS
+## Install
+```bash
+git clone https://github.com/santhoshkammari/checkboxer
+cd checkboxer
+pip install numba opencv-python numpy
+```
+## Journey
+See [JOURNEY.md](JOURNEY.md) (v0→v6, 4248ms→37ms) and [JOURNEY2.md](JOURNEY2.md) (v7→v10, 37ms→6ms).
+## Author
+[santhoshkammari](https://github.com/santhoshkammari)

npboxdetect-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,66 @@
+# checkboxer / npboxdetect
+> Fastest checkbox detector. NumPy + Numba. No deep learning.
+## Benchmark
+| Approach | p50 (ms) | boxes | vs npboxdetect |
+|---|---|---|---|
+| **npboxdetect** | **34ms** | 69 | baseline |
+| boxdetect | 60ms | 67 | 1.8x slower |
+| opencv_contours | 33ms | 179 | ~same (but noisy) |
+| morphology | 34ms | 25 | ~same (misses many) |
+*50-run p50, full pipeline including imread, on 2200×1700 document images.*
+**Pure processing (imread excluded):**
+| Image | npboxdetect | boxdetect | Speedup |
+|---|---|---|---|
+| lc_application1.png | **6.5ms** | 25.5ms | **3.9x** |
+| lc_application2.png | **7.1ms** | 18.6ms | **2.6x** |
+## Usage
+```python
+from npboxdetect.detector import get_boxes
+boxes = get_boxes("form.png")
+# [(x, y, w, h), ...]
+```
+## How it works
+Five steps, all near-hardware-limit:
+```
+downsample 2x        ~0ms   stride view, zero copy
+otsu + threshold      1.4ms  fused numba: histogram → binarize in one shot
+morph open            1.3ms  numba parallel run-length scan (rows+cols)
+connected components  1.3ms  cv2.connectedComponentsWithStats
+NMS                   0.2ms  vectorized pairwise IoU matrix
+─────────────────────────────
+processing total      ~6ms
+```
+Key ideas:
+- **2x downsample** before processing — free stride view, halves pixel count
+- **Fused numba otsu+threshold** — one serial histogram pass + one parallel binarize pass, no cv2 call
+- **Run-length morph open** — each row/column independent → `prange` parallelism, no cumsum overhead
+- **Filter before NMS** — size/ratio filter drops 1000+ blobs to ~70 before NMS
+## Install
+```bash
+git clone https://github.com/santhoshkammari/checkboxer
+cd checkboxer
+pip install numba opencv-python numpy
+```
+## Journey
+See [JOURNEY.md](JOURNEY.md) (v0→v6, 4248ms→37ms) and [JOURNEY2.md](JOURNEY2.md) (v7→v10, 37ms→6ms).
+## Author
+[santhoshkammari](https://github.com/santhoshkammari)