openvisionkit 0.4.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1018 @@
1
+ Metadata-Version: 2.4
2
+ Name: openvisionkit
3
+ Version: 0.4.0
4
+ Summary: MediaPipe Tasks API wrapper for Python computer vision
5
+ License: MIT
6
+ Keywords: computer-vision,face-detection,mediapipe,opencv,pose-estimation
7
+ Classifier: Development Status :: 3 - Alpha
8
+ Classifier: Intended Audience :: Developers
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.11
12
+ Classifier: Programming Language :: Python :: 3.12
13
+ Classifier: Topic :: Scientific/Engineering :: Image Recognition
14
+ Requires-Python: >=3.11.8
15
+ Requires-Dist: imageio>=2.37.3
16
+ Requires-Dist: imutils>=0.5.4
17
+ Requires-Dist: mediapipe>=0.10.35
18
+ Requires-Dist: mss>=10.2.0
19
+ Requires-Dist: opencv-python>=4.13.0.92
20
+ Requires-Dist: pandas>=3.0.3
21
+ Requires-Dist: pyautogui>=0.9.54
22
+ Requires-Dist: pytesseract>=0.3.13
23
+ Requires-Dist: scikit-image>=0.26.0
24
+ Description-Content-Type: text/markdown
25
+
26
+ # OpenVisionKit
27
+
28
+ [![CI — Unit Tests](https://github.com/anurupborah2001/openvisionkit/actions/workflows/ci-unit.yml/badge.svg)](https://github.com/your-org/openvisionkit/actions/workflows/ci-unit.yml)
29
+ [![Security Scan](https://github.com/anurupborah2001/openvisionkit/actions/workflows/ci-security.yml/badge.svg)](https://github.com/your-org/openvisionkit/actions/workflows/ci-security.yml)
30
+ [![PyPI version](https://badge.fury.io/py/openvisionkit.svg)](https://pypi.org/p/openvisionkit)
31
+ [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
32
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
33
+
34
+ **OpenVisionKit** is a high-level Python computer vision library built on top of [MediaPipe](https://developers.google.com/mediapipe) and [OpenCV](https://opencv.org/). It provides production-ready detectors and segmentation utilities for face detection, face mesh, hand tracking, pose estimation, object detection, and background segmentation — wrapped in clean, developer-friendly APIs that eliminate boilerplate and let you focus on building.
35
+
36
+ Whether you are prototyping a gesture-controlled application, building a fitness tracker, adding AR effects, or conducting research, OpenVisionKit gives you the tools to go from camera frame to structured detections in a few lines of code.
37
+
38
+ ---
39
+
40
+ ## Features
41
+
42
+ | Module | Capability |
43
+ |---|---|
44
+ | `FaceDetector` | Bounding boxes, 6-point keypoints, confidence filtering, IoU, face cropping |
45
+ | `FaceMeshDetector` | 478 landmarks, blendshapes, head pose (yaw/pitch/roll), gaze direction, emotion, AR overlays |
46
+ | `HandDetector` | 21 landmarks, gesture recognition, finger-join detection, distance estimation, palm width |
47
+ | `PoseDetector` | 33 body landmarks, joint angle calculation, exercise detection, workout rep counter, segmentation |
48
+ | `ObjectDetector` | EfficientDet-based multi-class detection with bounding boxes and labels |
49
+ | `SelfieSegmentation` | Background removal, blur, replacement, virtual backgrounds, alpha blending |
50
+ | `HairSegmentation` | Hair region segmentation and recoloring |
51
+ | `ScreenCapture` | High-performance screen grabbing via `mss` |
52
+ | `video_capture_template` | Drop-in webcam loop with FPS overlay, recording, and screenshot support |
53
+ | `image_template` | Single-image processing template with auto-centering, resize, and custom logic hook |
54
+ | `TextDetector` | Tesseract OCR with character/word/digit/table detection, NLP entity extraction, image matching, handwriting support |
55
+
56
+ ---
57
+
58
+ ## Requirements
59
+
60
+ - Python >= 3.11.8
61
+ - A `.tflite` / `.task` model file for each MediaPipe detector (see [Model Downloads](#model-downloads))
62
+
63
+ ### TextDetector additional requirements
64
+
65
+ `TextDetector` uses Tesseract OCR and optional NLP tooling that are **not** bundled with MediaPipe.
66
+
67
+ **1. Install Tesseract binary** (system-level):
68
+
69
+ ```bash
70
+ # macOS
71
+ brew install tesseract
72
+
73
+ # Ubuntu / Debian
74
+ sudo apt-get install tesseract-ocr
75
+
76
+ # Windows
77
+ # Download installer from https://github.com/UB-Mannheim/tesseract/wiki
78
+ ```
79
+
80
+ **2. Install Python packages:**
81
+
82
+ ```bash
83
+ # pip
84
+ pip install pytesseract imutils pandas scikit-image Pillow
85
+
86
+ # uv
87
+ uv add pytesseract imutils pandas scikit-image Pillow
88
+ ```
89
+
90
+ **3. (Optional) spaCy for NLP features** (entity extraction, keyword extraction, summarization, relation extraction):
91
+
92
+ ```bash
93
+ pip install spacy
94
+ python -m spacy download en_core_web_sm
95
+ ```
96
+
97
+ Without spaCy, all NLP methods return empty results gracefully — the rest of `TextDetector` works without it.
98
+
99
+ ---
100
+
101
+ ## Installation
102
+
103
+ ### pip
104
+
105
+ ```bash
106
+ pip install openvisionkit
107
+ ```
108
+
109
+ Or install directly from source:
110
+
111
+ ```bash
112
+ pip install git+https://github.com/your-org/openvisionkit.git
113
+ ```
114
+
115
+ ### uv
116
+
117
+ ```bash
118
+ uv add openvisionkit
119
+ ```
120
+
121
+ Or from source:
122
+
123
+ ```bash
124
+ uv add git+https://github.com/your-org/openvisionkit.git
125
+ ```
126
+
127
+ For development (editable install with all dev dependencies):
128
+
129
+ ```bash
130
+ git clone https://github.com/your-org/openvisionkit.git
131
+ cd openvisionkit
132
+ make setup
133
+ ```
134
+
135
+ ---
136
+
137
+ ## Model Downloads
138
+
139
+ OpenVisionKit delegates inference to MediaPipe `.tflite` / `.task` model files. Download the models you need and place them in a `models/` directory at your project root.
140
+
141
+ | Detector | Model file | Download |
142
+ |---|---|---|
143
+ | `FaceDetector` | `face_detector.tflite` | [MediaPipe Face Detector](https://developers.google.com/mediapipe/solutions/vision/face_detector) |
144
+ | `FaceMeshDetector` | `face_landmarker_v2_with_blendshapes.task` | [MediaPipe Face Landmarker](https://developers.google.com/mediapipe/solutions/vision/face_landmarker) |
145
+ | `HandDetector` | `hand_landmarker.task` | [MediaPipe Hand Landmarker](https://developers.google.com/mediapipe/solutions/vision/hand_landmarker) |
146
+ | `PoseDetector` | `pose_landmarker.task` | [MediaPipe Pose Landmarker](https://developers.google.com/mediapipe/solutions/vision/pose_landmarker) |
147
+ | `ObjectDetector` | `efficientdet_lite.tflite` | [MediaPipe Object Detector](https://developers.google.com/mediapipe/solutions/vision/object_detector) |
148
+ | `SelfieSegmentation` | `deeplab_v3.tflite` | [MediaPipe Image Segmenter](https://developers.google.com/mediapipe/solutions/vision/image_segmenter) |
149
+ | `HairSegmentation` | `hair_segmenter.tflite` | [MediaPipe Hair Segmenter](https://developers.google.com/mediapipe/solutions/vision/image_segmenter) |
150
+
151
+ ---
152
+
153
+ ## Quick Start
154
+
155
+ ```python
156
+ import cv2
157
+ from openvisionkit.capture.video_template import video_capture_template
158
+ from openvisionkit.lib.hand_detector import HandDetector
159
+
160
+ detector = HandDetector(model_path="./models/hand_landmarker.task")
161
+
162
+ def process(frame):
163
+ frame = detector.draw_landmarks(frame)
164
+ return frame
165
+
166
+ video_capture_template(custom_logic=process, window_name="Hand Tracking")
167
+ ```
168
+
169
+ ---
170
+
171
+ ## Usage
172
+
173
+ ### FaceDetector
174
+
175
+ Detects faces in an image or video stream and returns bounding boxes, keypoints, and confidence scores.
176
+
177
+ ```python
178
+ import cv2
179
+ from openvisionkit.lib.face_detector import FaceDetector
180
+
181
+ detector = FaceDetector(
182
+ model_path="./models/face_detector.tflite",
183
+ max_faces=5,
184
+ running_mode="IMAGE", # "IMAGE" | "VIDEO"
185
+ min_detection_confidence=0.5,
186
+ min_suppression_threshold=0.3,
187
+ )
188
+
189
+ frame = cv2.imread("photo.jpg")
190
+
191
+ # Returns annotated frame + list of detection dicts
192
+ annotated, detections = detector.detect_faces(frame, to_draw_bounding_box=True, to_draw_landmarks=True)
193
+
194
+ for det in detections:
195
+ print(det["id"]) # face index
196
+ print(det["score"]) # confidence 0–1
197
+ print(det["bbox"]) # (x, y, w, h)
198
+ print(det["bbox_xyxy"]) # (x1, y1, x2, y2)
199
+ print(det["center"]) # (cx, cy)
200
+ print(det["normalized_keypoints"]) # list of (x, y) pixel coords for 6 landmarks
201
+
202
+ cv2.imshow("Faces", annotated)
203
+ cv2.waitKey(0)
204
+ ```
205
+
206
+ **Utility methods:**
207
+
208
+ ```python
209
+ # Filter detections below a confidence threshold
210
+ confident = detector.filter_by_confidence(detections, threshold=0.7)
211
+
212
+ # Get the largest face by bounding-box area
213
+ biggest = detector.get_largest_face(detections)
214
+
215
+ # Crop face regions out of the image (optional pixel margin)
216
+ face_crops = detector.crop_faces(frame, detections, margin=10)
217
+
218
+ # Sort by area (descending) or any other detection key
219
+ sorted_faces = detector.sort_faces(detections, by="area")
220
+
221
+ # Intersection over Union — useful for NMS or tracking
222
+ iou = detector.get_iou(detections[0]["bbox_xyxy"], detections[1]["bbox_xyxy"])
223
+ ```
224
+
225
+ ---
226
+
227
+ ### FaceMeshDetector
228
+
229
+ Detects 478 facial landmarks per face along with blendshape expressions and head-pose matrices.
230
+
231
+ ```python
232
+ import cv2
233
+ from openvisionkit.lib.face_mesh_detector import FaceMeshDetector
234
+
235
+ detector = FaceMeshDetector(
236
+ model_path="./models/face_landmarker_v2_with_blendshapes.task",
237
+ num_faces=2,
238
+ min_face_detection_confidence=0.5,
239
+ output_face_blendshapes=True,
240
+ output_facial_transformation_matrixes=True,
241
+ )
242
+
243
+ frame = cv2.imread("face.jpg")
244
+
245
+ annotated, faces, blendshapes, matrices, bboxes = detector.face_mesh_detection(frame, drawLandMarks=True)
246
+
247
+ # faces[i] -> list of [x, y] pixel coords for 478 landmarks
248
+ # blendshapes[i] -> dict of {blendshape_name: score} (52 expressions)
249
+ # matrices[i] -> 4x4 numpy head-pose matrix
250
+ # bboxes[i] -> [min_x, min_y, max_x, max_y]
251
+
252
+ for i, blend in enumerate(blendshapes):
253
+ # Rule-based emotion from blendshapes
254
+ emotion = detector.get_emotion(blend)
255
+ print(f"Face {i}: {emotion}")
256
+
257
+ # Gaze direction for each eye
258
+ gaze = detector.get_eye_gaze_direction(faces[i], is_left_eye=True)
259
+ print(f"Left gaze: {gaze}") # "Left" | "Center" | "Right"
260
+
261
+ # Mouth openness ratio (0 = closed, 0.5+ = wide open)
262
+ ratio = detector.get_mouth_openness_ratio(faces[i])
263
+ print(f"Mouth ratio: {ratio:.2f}")
264
+
265
+ # Head pose angles from transformation matrix
266
+ if matrices[i] is not None:
267
+ yaw, pitch, roll = detector.get_head_pose_angles(matrices[i])
268
+ print(f"Yaw: {yaw:.1f} Pitch: {pitch:.1f} Roll: {roll:.1f}")
269
+
270
+ # Inter-pupillary distance
271
+ ipd = detector.get_inter_pupillary_distance(faces[i], normalized=False)
272
+ print(f"IPD: {ipd:.1f}px")
273
+ ```
274
+
275
+ **AR overlay example:**
276
+
277
+ ```python
278
+ # Overlay a PNG glasses filter (must have alpha channel)
279
+ glasses = cv2.imread("glasses.png", cv2.IMREAD_UNCHANGED) # RGBA
280
+ frame_with_glasses = detector.overlay_ar_filter(frame, faces[0], glasses, filter_type="glasses")
281
+ ```
282
+
283
+ ---
284
+
285
+ ### HandDetector
286
+
287
+ Tracks up to N hands with 21 landmarks each. Provides gesture recognition, finger-join detection, and distance estimation.
288
+
289
+ ```python
290
+ import cv2
291
+ from openvisionkit.lib.hand_detector import HandDetector
292
+
293
+ detector = HandDetector(
294
+ model_path="./models/hand_landmarker.task",
295
+ running_mode="IMAGE", # "IMAGE" | "VIDEO"
296
+ max_hands=2,
297
+ detection_confidence=0.5,
298
+ tracking_confidence=0.5,
299
+ smoothing_window=8,
300
+ )
301
+
302
+ frame = cv2.imread("hand.jpg")
303
+
304
+ # Draw landmarks, bounding box, and handedness label
305
+ annotated = detector.draw_landmarks(
306
+ frame,
307
+ to_draw_landmark=True,
308
+ to_draw_bounding_box=True,
309
+ to_put_handle_label=True,
310
+ )
311
+
312
+ # Get structured landmark data for all detected hands
313
+ all_hands = detector.get_landmarks(frame)
314
+
315
+ for hand in all_hands:
316
+ print(hand["hand_type"]) # "Left" or "Right"
317
+ print(hand["bounding_box"]) # (x, y, w, h)
318
+ print(hand["center_point"]) # (cx, cy)
319
+ lm = hand["landmarks_list"] # list of [id, x, y, z]
320
+
321
+ # Which fingers are raised?
322
+ fingers = detector.fingers_up(lm)
323
+ # [thumb, index, middle, ring, little] — 1=up, 0=down
324
+
325
+ # Gesture shortcuts
326
+ print(detector.is_fist())
327
+ print(detector.is_thumbs_up())
328
+ print(detector.is_peace_sign())
329
+ print(detector.is_open_hand())
330
+
331
+ # Distance between any two landmarks with visual feedback
332
+ p1 = (lm[4][1], lm[4][2]) # thumb tip
333
+ p2 = (lm[8][1], lm[8][2]) # index tip
334
+ length, annotated, coords = detector.get_distance(p1, p2, annotated)
335
+ print(f"Thumb-index distance: {length:.1f}px")
336
+
337
+ # Detect if two finger tips are touching
338
+ joined = detector.is_fingers_joined(4, 8, annotated, lm, threshold=0.25)
339
+
340
+ # Palm width in pixels (stable reference)
341
+ palm_px, idx_mcp, pinky_mcp = detector.palm_width_px(frame, lm)
342
+ print(f"Palm width: {palm_px:.1f}px")
343
+ ```
344
+
345
+ **Distance estimation (calibration-based):**
346
+
347
+ ```python
348
+ # Provide (palm_width_px, distance_cm) pairs to calibrate
349
+ calibration = [(180, 20), (120, 35), (80, 55), (60, 75)]
350
+ detector_calibrated = HandDetector(
351
+ model_path="./models/hand_landmarker.task",
352
+ calibration_samples=calibration,
353
+ )
354
+
355
+ # After calibration, estimate distance from a new palm width
356
+ distance_cm = detector_calibrated.estimate_distance_cm(palm_width_px=110)
357
+ print(f"Estimated distance: {distance_cm:.1f} cm")
358
+ ```
359
+
360
+ ---
361
+
362
+ ### PoseDetector
363
+
364
+ Detects 33 body landmarks. Supports joint angle calculation, exercise classification, workout rep counting, and body segmentation.
365
+
366
+ ```python
367
+ import cv2
368
+ from openvisionkit.lib.pose_detector import PoseDetector
369
+ from mediapipe.tasks.python import vision
370
+
371
+ detector = PoseDetector(
372
+ model_path="./models/pose_landmarker.task",
373
+ running_mode=vision.RunningMode.VIDEO, # VIDEO for webcam streams
374
+ num_poses=1,
375
+ min_pose_detection_confidence=0.5,
376
+ output_segmentation_masks=True,
377
+ )
378
+
379
+ cap = cv2.VideoCapture(0)
380
+ while True:
381
+ ret, frame = cap.read()
382
+ if not ret:
383
+ break
384
+
385
+ # Detect and annotate
386
+ annotated, result = detector.detect(frame, draw_landmarks=True)
387
+
388
+ # All landmark positions as pixel dicts
389
+ landmarks = detector.get_all_postion(frame, result)
390
+
391
+ # Get a specific landmark (e.g. nose = id 0)
392
+ nose = detector.get_landmark(result, pose_index=0, landmark_id=0)
393
+ print(nose["x"], nose["y"], nose["visibility"])
394
+
395
+ # Calculate joint angle — e.g. left elbow (shoulder=11, elbow=13, wrist=15)
396
+ annotated, angle = detector.calculate_angle(annotated, result, p1=11, p2=13, p3=15)
397
+ print(f"Left elbow angle: {angle:.1f} degrees")
398
+
399
+ # Classify current exercise
400
+ exercise = detector.detect_exercise(annotated, result)
401
+ print(f"Exercise: {exercise}")
402
+
403
+ # Workout rep counter (tracks bicep curls automatically)
404
+ angle, percent, reps = detector.calculate_workout_percentage()
405
+ stats = detector.get_workout_stats(annotated)
406
+ print(f"Reps: {stats['reps']} Calories: {stats['calories']:.1f}")
407
+
408
+ # Body segmentation overlay (requires output_segmentation_masks=True)
409
+ annotated = detector.draw_segmentation_mask(annotated, result, alpha=0.5, color=(0, 255, 0))
410
+
411
+ cv2.imshow("Pose", annotated)
412
+ if cv2.waitKey(1) & 0xFF == 27:
413
+ break
414
+
415
+ cap.release()
416
+ cv2.destroyAllWindows()
417
+ ```
418
+
419
+ **Auto-select the most visible arm for curl tracking:**
420
+
421
+ ```python
422
+ p1, p2, p3 = detector.select_active_arm(result)
423
+ annotated, angle = detector.calculate_angle(annotated, result, p1, p2, p3)
424
+ ```
425
+
426
+ ---
427
+
428
+ ### ObjectDetector
429
+
430
+ Detects multiple object classes in a frame using EfficientDet Lite.
431
+
432
+ ```python
433
+ import cv2
434
+ from openvisionkit.lib.object_detector import ObjectDetector
435
+
436
+ detector = ObjectDetector(
437
+ model_path="./models/efficientdet_lite.tflite",
438
+ max_results=5,
439
+ running_mode="IMAGE", # "IMAGE" | "VIDEO"
440
+ category_allowlist=None, # e.g. ["person", "car"] to restrict classes
441
+ category_denylist=None,
442
+ )
443
+
444
+ frame = cv2.imread("street.jpg")
445
+
446
+ # Returns annotated image with bounding boxes and labels drawn
447
+ annotated = detector.detect_objects(frame)
448
+
449
+ cv2.imshow("Objects", annotated)
450
+ cv2.waitKey(0)
451
+
452
+ # Or get raw detection result for custom processing
453
+ result, mp_image = detector.detect(frame)
454
+ for detection in result.detections:
455
+ label = detection.categories[0].category_name
456
+ score = detection.categories[0].score
457
+ bbox = detection.bounding_box
458
+ print(f"{label}: {score:.2f} @ ({bbox.origin_x}, {bbox.origin_y})")
459
+ ```
460
+
461
+ ---
462
+
463
+ ### SelfieSegmentation
464
+
465
+ Separates people from backgrounds using DeepLab V3. Multiple compositing modes available.
466
+
467
+ ```python
468
+ import cv2
469
+ from openvisionkit.lib.selfie_segmentation import SelfieSegmentation
470
+
471
+ seg = SelfieSegmentation(
472
+ model_path="./models/deeplab_v3.tflite",
473
+ output_category_mask=True,
474
+ )
475
+
476
+ frame = cv2.imread("selfie.jpg")
477
+
478
+ # Remove background (black fill)
479
+ no_bg = seg.remove_background(frame)
480
+
481
+ # Blur background
482
+ blurred = seg.blur_background(frame, blur_strength=(55, 55))
483
+
484
+ # Replace background with an image
485
+ replaced = seg.replace_background(frame, background_path="./bg.jpg")
486
+
487
+ # Solid color background
488
+ colored = seg.color_background(frame, color=(0, 120, 255))
489
+
490
+ # Alpha-blend foreground over a custom background array
491
+ bg = cv2.imread("./bg.jpg")
492
+ blended = seg.alpha_blend(frame, bg)
493
+
494
+ # Optimized virtual background with temporal smoothing + edge refinement
495
+ # (best for real-time webcam use)
496
+ output = seg.optimize_virtual_background(frame, bg)
497
+
498
+ # Single-person isolation — removes other people in the background
499
+ output = seg.optimize_virtual_background_improved(frame, bg)
500
+
501
+ # Debug: visualize the raw segmentation heatmap
502
+ heatmap = seg.overlay_mask(frame)
503
+
504
+ cv2.imshow("Segmented", output)
505
+ cv2.waitKey(0)
506
+ ```
507
+
508
+ ---
509
+
510
+ ### HairSegmentation
511
+
512
+ Segments hair regions for recoloring or styling effects.
513
+
514
+ ```python
515
+ import cv2
516
+ import numpy as np
517
+ from openvisionkit.lib.hair_segmentation import HairSegmentation
518
+
519
+ seg = HairSegmentation(model_path="./models/hair_segmenter.tflite")
520
+
521
+ frame = cv2.imread("portrait.jpg")
522
+ rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
523
+
524
+ result = seg.process(rgb_frame)
525
+ mask = result.category_mask.numpy_view() # shape (H, W), values 0–1
526
+
527
+ # Recolor hair to blue
528
+ hair_color = np.zeros_like(frame)
529
+ hair_color[:] = (255, 0, 0) # BGR blue
530
+ hair_region = (mask > 0.5)[..., None]
531
+ output = np.where(hair_region, hair_color, frame)
532
+
533
+ cv2.imshow("Hair", output)
534
+ cv2.waitKey(0)
535
+ ```
536
+
537
+ ---
538
+
539
+ ### ScreenCapture
540
+
541
+ Captures live frames from a monitor — useful for screen-based CV pipelines.
542
+
543
+ ```python
544
+ from openvisionkit.capture.screen_capture import ScreenCapture
545
+ import cv2
546
+
547
+ cap = ScreenCapture(monitor_index=1) # 1 = primary monitor
548
+
549
+ while True:
550
+ frame = cap.grab() # returns BGR numpy array
551
+ cv2.imshow("Screen", frame)
552
+ if cv2.waitKey(1) & 0xFF == 27:
553
+ break
554
+
555
+ cv2.destroyAllWindows()
556
+ ```
557
+
558
+ ---
559
+
560
+ ### video_capture_template
561
+
562
+ A reusable webcam loop that handles window setup, FPS display, recording, and screenshots. Pass a `custom_logic` callback for your processing.
563
+
564
+ ```python
565
+ import cv2
566
+ from openvisionkit.capture.video_template import video_capture_template
567
+ from openvisionkit.lib.face_detector import FaceDetector
568
+
569
+ detector = FaceDetector(model_path="./models/face_detector.tflite", running_mode="VIDEO")
570
+
571
+ def process(frame):
572
+ annotated, _ = detector.detect_faces(frame)
573
+ return annotated
574
+
575
+ video_capture_template(
576
+ video_source=0, # webcam index or path to video file
577
+ custom_logic=process,
578
+ window_name="Face Detection",
579
+ resolution=(1280, 720),
580
+ draw_fps=True,
581
+ enable_auto_recording=True, # auto-saves .mp4 from first frame
582
+ record_format="mp4", # "mp4" or "gif"
583
+ enable_screenshot=True, # press 's' to capture a frame
584
+ auto_screenshot_after_seconds=10.0, # also auto-capture after 10 s
585
+ auto_screenshot_repeat=False, # True = repeat every 10 s
586
+ )
587
+ ```
588
+
589
+ **Key bindings (built-in):**
590
+
591
+ | Key | Action | Condition |
592
+ |---|---|---|
593
+ | `ESC` | Exit loop | always |
594
+ | `s` / `S` | Save screenshot | `enable_screenshot=True` |
595
+ | `r` / `R` | Toggle manual recording on/off | `enable_manual_recording=True` |
596
+
597
+ **Stateful key handlers with `KeyEventManager`:**
598
+
599
+ ```python
600
+ from openvisionkit.capture.video_template import KeyEventManager, video_capture_template
601
+
602
+ state = {"score": 0}
603
+ km = KeyEventManager()
604
+ km.register(ord("p"), lambda frame, s: print(f"Score: {s['score']}"))
605
+ km.register(ord("+"), lambda frame, s: s.update({"score": s["score"] + 1}))
606
+
607
+ video_capture_template(
608
+ video_source=0,
609
+ state=state,
610
+ key_manager=km,
611
+ custom_logic=lambda frame: frame,
612
+ )
613
+ ```
614
+
615
+ **Manual recording:**
616
+
617
+ ```python
618
+ video_capture_template(
619
+ video_source=0,
620
+ enable_manual_recording=True, # press R to start, R again to stop and save
621
+ record_format="gif",
622
+ )
623
+ ```
624
+
625
+ **Parameter reference:**
626
+
627
+ | Parameter | Type | Default | Description |
628
+ |---|---|---|---|
629
+ | `video_source` | `int \| str` | `0` | Camera index or path to video file |
630
+ | `loop_forever` | `bool` | `True` | Loop video file when it ends |
631
+ | `custom_logic` | `Callable[[ndarray], ndarray]` | `None` | Per-frame processing; receives and returns BGR image |
632
+ | `state` | `dict` | `None` | Shared state dict passed to every key handler |
633
+ | `key_manager` | `KeyEventManager` | `None` | Custom key-event dispatcher |
634
+ | `window_name` | `str` | `"Demo"` | OpenCV window title |
635
+ | `show_window` | `bool` | `True` | Display the OpenCV window |
636
+ | `resolution` | `tuple[int, int]` | `(1280, 720)` | Camera resolution `(width, height)` |
637
+ | `center_window` | `bool` | `True` | Auto-center window on screen via pyautogui |
638
+ | `draw_fps` | `bool` | `True` | Overlay FPS counter on frame |
639
+ | `fps` | `int` | `15` | Recording frame rate (auto-recording only) |
640
+ | `mouse_callback` | `Callable` | `None` | OpenCV mouse-event callback |
641
+ | `mouse_callback_params` | `dict` | `None` | Extra params passed to mouse callback |
642
+ | `enable_auto_recording` | `bool` | `False` | Record every frame automatically from start |
643
+ | `enable_manual_recording` | `bool` | `False` | Allow toggling recording with `R` key |
644
+ | `record_format` | `str` | `"mp4"` | `"mp4"` or `"gif"` |
645
+ | `enable_screenshot` | `bool` | `False` | Enable `s`-key and auto-screenshot |
646
+ | `screenshot_output_dir` | `str` | `"screenshots"` | Directory for saved screenshots |
647
+ | `screenshot_prefix` | `str` | `"capture"` | Filename prefix before timestamp |
648
+ | `auto_screenshot_after_seconds` | `float` | `None` | Trigger first screenshot after N seconds |
649
+ | `auto_screenshot_repeat` | `bool` | `False` | Repeat auto-screenshot every N seconds |
650
+
651
+ ---
652
+
653
+ ### image_template
654
+
655
+ A single-image equivalent of `video_capture_template`. Loads one image from disk, applies an optional processing callback, resizes to the target resolution, auto-centers the window on screen, and displays it.
656
+
657
+ ```python
658
+ import cv2
659
+ from openvisionkit.capture.image_template import image_template
660
+ from openvisionkit.lib.face_detector import FaceDetector
661
+
662
+ detector = FaceDetector(model_path="./models/face_detector.tflite", running_mode="IMAGE")
663
+
664
+ def process(frame):
665
+ annotated, _ = detector.detect_faces(frame)
666
+ return annotated
667
+
668
+ image_template(
669
+ image_path="photo.jpg",
670
+ custom_logic=process, # receives the loaded BGR image, must return BGR image
671
+ window_name="Face Demo",
672
+ resolution=(1280, 720), # image is resized to this before display
673
+ center_window=True, # auto-centers window on screen via pyautogui
674
+ show_window=True, # set False to run headless (e.g. save to disk instead)
675
+ )
676
+ ```
677
+
678
+ Without a `custom_logic` callback the image is loaded, resized, and displayed as-is:
679
+
680
+ ```python
681
+ image_template(image_path="photo.jpg")
682
+ ```
683
+
684
+ **Parameter reference:**
685
+
686
+ | Parameter | Type | Default | Description |
687
+ |---|---|---|---|
688
+ | `image_path` | `str` | required | Path to the image file |
689
+ | `custom_logic` | `Callable[[ndarray], ndarray]` | `None` | Processing function applied before display |
690
+ | `window_name` | `str` | `"Demo"` | OpenCV window title |
691
+ | `resolution` | `tuple[int, int]` | `(1280, 720)` | `(width, height)` to resize the image |
692
+ | `center_window` | `bool` | `True` | Move window to screen center via pyautogui |
693
+ | `show_window` | `bool` | `True` | Display the OpenCV window |
694
+
695
+ ---
696
+
697
+ ### TextDetector
698
+
699
+ Tesseract-backed OCR class with per-character, per-word, and per-digit detection, document boundary detection, table extraction, image-to-image feature matching, cursive/handwriting OCR, and optional NLP post-processing via spaCy.
700
+
701
+ #### Installation prerequisites
702
+
703
+ See [TextDetector additional requirements](#textdetector-additional-requirements) above before using this class.
704
+
705
+ #### Basic OCR
706
+
707
+ ```python
708
+ import cv2
709
+ from openvisionkit.lib.text_detector import TextDetector
710
+
711
+ image = cv2.imread("document.jpg")
712
+
713
+ detector = TextDetector(
714
+ image=image,
715
+ lang="eng", # Tesseract language code(s); multi-language: "eng+chi_sim"
716
+ oem=3, # OCR Engine Mode — 3 = default (LSTM preferred)
717
+ psm=6, # Page Segmentation Mode — 6 = single uniform text block
718
+ preprocess=True, # apply grayscale + histogram equalization + adaptive threshold
719
+ use_gpu=False, # enable OpenCL GPU acceleration for OpenCV ops
720
+ )
721
+
722
+ # Full text string from the image
723
+ text = detector.detect_text()
724
+ print(text)
725
+
726
+ # Switch language at runtime (no need to reinstantiate)
727
+ detector.set_language("eng+fra")
728
+
729
+ # Replace the image on an existing instance
730
+ new_image = cv2.imread("page2.jpg")
731
+ detector.set_image(new_image)
732
+ ```
733
+
734
+ #### Word-level detection
735
+
736
+ ```python
737
+ words, annotated = detector.detect_words(
738
+ draw_boxes=True,
739
+ bounding_box_color=(255, 0, 0), # BGR
740
+ text_color=(255, 0, 0),
741
+ font_scale=1,
742
+ font_thickness=2,
743
+ )
744
+
745
+ for word in words:
746
+ print(word["text"]) # recognized word string
747
+ print(word["conf"]) # Tesseract confidence 0–100
748
+ print(word["x"], word["y"], word["w"], word["h"]) # bounding box
749
+
750
+ cv2.imshow("Words", annotated)
751
+ cv2.waitKey(0)
752
+
753
+ # Convenience accessors
754
+ word_strings = detector.get_words() # List[str]
755
+ lines = detector.get_lines() # List[str] — full lines
756
+ avg_conf = detector.get_confidence() # float — mean confidence across all words
757
+ df = detector.to_dataframe() # pandas DataFrame of word detections
758
+ ```
759
+
760
+ #### Character-level detection
761
+
762
+ ```python
763
+ chars, annotated = detector.detect_characters(
764
+ draw_boxes=True,
765
+ is_dark_background=False, # set True to invert image before OCR
766
+ adjust_text_height=20, # vertical offset for label above bounding box
767
+ bounding_box_color=(255, 0, 0),
768
+ text_color=(255, 0, 0),
769
+ )
770
+
771
+ for c in chars:
772
+ print(c["char"]) # single character string
773
+ print(c["x1"], c["y1"]) # top-left (OpenCV coords)
774
+ print(c["x2"], c["y2"]) # bottom-right (OpenCV coords)
775
+ ```
776
+
777
+ #### Digit-only detection
778
+
779
+ ```python
780
+ digits, annotated = detector.detect_digits(image, draw_boxes=True)
781
+ print(digits) # e.g. ['4', '2', '0']
782
+ ```
783
+
784
+ #### Document & table detection
785
+
786
+ ```python
787
+ # Detect document boundary (returns 4-corner numpy array, or None)
788
+ corners = detector.detect_document()
789
+ if corners is not None:
790
+ print("Document corners:", corners)
791
+
792
+ # Extract text from table regions using morphological line detection
793
+ tables = detector.detect_tables()
794
+ for table_text in tables:
795
+ print(table_text)
796
+ ```
797
+
798
+ #### Orientation & script detection
799
+
800
+ ```python
801
+ osd = detector.image_to_osd()
802
+ print(osd["Orientation in degrees"]) # e.g. '90'
803
+ print(osd["Script"]) # e.g. 'Latin'
804
+ ```
805
+
806
+ #### Export formats
807
+
808
+ ```python
809
+ # PDF bytes
810
+ pdf_bytes = detector.image_to_pdf_or_hocr(extension="pdf")
811
+ with open("output.pdf", "wb") as f:
812
+ f.write(pdf_bytes)
813
+
814
+ # hOCR HTML bytes
815
+ hocr_bytes = detector.image_to_pdf_or_hocr(extension="hocr")
816
+
817
+ # ALTO XML string (structured layout format for digital libraries)
818
+ alto_xml = detector.image_to_alto_xml()
819
+ ```
820
+
821
+ #### Handwriting / cursive OCR
822
+
823
+ ```python
824
+ text, preprocessed = detector.extract_cursive_text(image)
825
+ print(text)
826
+ # preprocessed is the adaptive-threshold binary image used for OCR
827
+ ```
828
+
829
+ #### Image preprocessing utilities
830
+
831
+ ```python
832
+ # Resize (uses imutils to preserve aspect ratio)
833
+ resized = detector.resize(width=800)
834
+
835
+ # Rotate (may clip corners)
836
+ rotated = detector.rotate(angle=45)
837
+
838
+ # Rotate without clipping
839
+ rotated_bound = detector.rotate_bound(angle=45)
840
+
841
+ # Auto deskew (corrects small rotation from skewed scans)
842
+ deskewed = detector.deskew()
843
+
844
+ # Auto Canny edge detection with sigma-based threshold
845
+ edges = detector.auto_canny(sigma=0.33)
846
+ ```
847
+
848
+ #### ORB keypoint detection and image matching
849
+
850
+ These methods are useful for comparing a scanned form against a template to detect alignment, tampering, or form type.
851
+
852
+ ```python
853
+ # Detect ORB keypoints and descriptors
854
+ keypoints, descriptors, annotated = detector.detect_keypoints(
855
+ features=500,
856
+ draw_keypoints=True,
857
+ keypoint_color=(0, 255, 0),
858
+ )
859
+
860
+ # Compare two images using KNN feature matching + RANSAC homography
861
+ # Falls back to SSIM if not enough features are found
862
+ template = cv2.imread("template.jpg")
863
+ result = detector.compare_matches_knn_matcher(
864
+ image2=template,
865
+ form_name="Invoice",
866
+ no_of_feature=500,
867
+ matched_amount=50,
868
+ percentage_of_matches=20,
869
+ draw_matches=False,
870
+ draw_aligned=False,
871
+ )
872
+ print(result["matches"]) # number of good matches
873
+ print(result["homography"]) # 3x3 transformation matrix
874
+ # result["aligned_image"] # template warped to match the query
875
+ # result["matched_image"] # side-by-side match visualization
876
+
877
+ # Brute-force matcher variant (no ratio test, faster but less selective)
878
+ result_bf = detector.compare_matches_bf_matcher(image2=template, form_name="Invoice")
879
+
880
+ # SSIM-based fallback (used automatically, also callable directly)
881
+ ssim_result = TextDetector.fallback_ssim(image, template, "Invoice")
882
+ print(ssim_result["ssim_score"]) # structural similarity 0.0–1.0
883
+ ```
884
+
885
+ #### NLP methods (requires spaCy `en_core_web_sm`)
886
+
887
+ ```python
888
+ raw_text = detector.detect_text()
889
+
890
+ # Clean whitespace and newlines
891
+ clean = detector.clean_text(raw_text)
892
+
893
+ # Named entity recognition — returns list of {text, label} dicts
894
+ entities = detector.extract_entities(raw_text)
895
+ # e.g. [{"text": "Singapore", "label": "GPE"}, {"text": "2026", "label": "DATE"}]
896
+
897
+ # Group entities by label
898
+ grouped = detector.group_entities(raw_text)
899
+ # e.g. {"GPE": ["Singapore"], "DATE": ["2026"]}
900
+
901
+ # Keyword extraction (nouns and proper nouns, stop-words filtered)
902
+ keywords = detector.extract_keywords(raw_text)
903
+
904
+ # Extractive summarization (top N sentences)
905
+ summary = detector.summarize(raw_text, max_sentences=3)
906
+
907
+ # Subject-verb-object relation extraction
908
+ relations = detector.extract_relations(raw_text)
909
+ # e.g. [{"subject": ["John"], "verb": "signed", "object": ["contract"]}]
910
+ ```
911
+
912
+ #### GPU acceleration
913
+
914
+ ```python
915
+ detector.enable_gpu() # enables OpenCV OpenCL (requires compatible GPU)
916
+ detector.disable_gpu() # revert to CPU
917
+ ```
918
+
919
+ ---
920
+
921
+ ## Project Structure
922
+
923
+ ```
924
+ openvisionkit/
925
+ ├── __init__.py # package version (__version__)
926
+ ├── lib/
927
+ │ ├── face_detector.py # FaceDetector
928
+ │ ├── face_mesh_detector.py # FaceMeshDetector (478 landmarks)
929
+ │ ├── hand_detector.py # HandDetector (21 landmarks)
930
+ │ ├── pose_detector.py # PoseDetector (33 landmarks)
931
+ │ ├── object_detector.py # ObjectDetector (EfficientDet)
932
+ │ ├── selfie_segmentation.py # SelfieSegmentation
933
+ │ ├── hair_segmentation.py # HairSegmentation
934
+ │ ├── fps_counter.py # FPSCounter utility
935
+ │ ├── classifier.py # Generic classifier
936
+ │ ├── form_detector.py # Form / document detector
937
+ │ ├── form_roi_detector.py # Form region-of-interest detector
938
+ │ ├── form_roi_annotator.py # Form annotation utilities
939
+ │ ├── image_detector.py # Image-based detector
940
+ │ ├── image_hsv_detector.py # HSV color-range detector
941
+ │ └── text_detector.py # Text detection
942
+ ├── capture/
943
+ │ ├── video_template.py # video_capture_template loop
944
+ │ ├── screen_capture.py # ScreenCapture
945
+ │ ├── video_recorder.py # VideoRecorder
946
+ │ ├── image_template.py # Single-image processing template
947
+ │ └── draw_object.py # Drawing helpers
948
+ └── utility/
949
+ ├── vision_utilis.py # Shared image utilities
950
+ └── live_plot.py # Real-time matplotlib plotting
951
+ ```
952
+
953
+ ---
954
+
955
+ ## Running Modes
956
+
957
+ All detectors support three MediaPipe running modes:
958
+
959
+ | Mode | Use case | Notes |
960
+ |---|---|---|
961
+ | `IMAGE` | Static images | No timestamp needed |
962
+ | `VIDEO` | Webcam / pre-recorded video | Pass `timestamp_ms` or let detector auto-increment |
963
+ | `LIVE_STREAM` | Async streaming | Results delivered via callback |
964
+
965
+ ---
966
+
967
+ ## Contributing
968
+
969
+ ### Dev setup
970
+
971
+ ```bash
972
+ git clone https://github.com/your-org/openvisionkit.git
973
+ cd openvisionkit
974
+ make setup # uv sync + install pre-commit hooks
975
+ ```
976
+
977
+ ### Useful Make targets
978
+
979
+ | Target | What it does |
980
+ |---|---|
981
+ | `make setup` | Install all deps + pre-commit hooks (run once after clone) |
982
+ | `make format` | Auto-format with black + isort |
983
+ | `make lint` | Run ruff + flake8 |
984
+ | `make lint-fix` | Auto-fix ruff-fixable issues |
985
+ | `make test` | Run all non-integration tests |
986
+ | `make test-cov` | Run tests with HTML coverage report |
987
+ | `make typecheck` | mypy static analysis |
988
+ | `make check` | format-check + lint + typecheck (pre-push sanity) |
989
+
990
+ ### Commit convention
991
+
992
+ All commits must follow [Conventional Commits](https://www.conventionalcommits.org/). The pre-commit hook enforces this.
993
+
994
+ | Prefix | Effect |
995
+ |---|---|
996
+ | `fix:`, `perf:`, `refactor:` | patch release |
997
+ | `feat:` | minor release |
998
+ | `feat!:` or `BREAKING CHANGE:` footer | major release |
999
+ | `chore:`, `docs:`, `test:`, `ci:` | no release |
1000
+
1001
+ ### CI/CD
1002
+
1003
+ | Workflow | Trigger | Purpose |
1004
+ |---|---|---|
1005
+ | `ci-unit.yml` | push / PR | Unit tests on Python 3.11 + 3.12 |
1006
+ | `ci-integration.yml` | push/PR to main, manual | Integration tests (requires model files) |
1007
+ | `ci-security.yml` | push/PR to main, daily 02:00 UTC | pip-audit, Trivy, CodeQL |
1008
+ | `renovate.yml` | weekly Monday 01:00 UTC | Automated dependency updates |
1009
+ | `semantic-release.yml` | push to main | Semantic version bump + GitHub Release |
1010
+ | `publish.yml` | GitHub Release published | Build + publish to PyPI via OIDC |
1011
+
1012
+ Releases are fully automated — push commits to `main` and the semantic-release workflow handles version bumping, tagging, changelog generation, and PyPI publishing.
1013
+
1014
+ ---
1015
+
1016
+ ## License
1017
+
1018
+ MIT