PyPI - ytcollector - Versions diffs - 1.0.8__py3-none-any.whl → 1.0.9__py3-none-any.whl - Mend

ytcollector 1.0.8py3-none-any.whl → 1.0.9py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

ytcollector/__init__.py +36 -11
ytcollector/analyzer.py +205 -0
ytcollector/cli.py +186 -218
ytcollector/config.py +66 -62
ytcollector/dataset_builder.py +136 -0
ytcollector/downloader.py +328 -480
ytcollector-1.0.9.dist-info/METADATA +207 -0
ytcollector-1.0.9.dist-info/RECORD +11 -0
ytcollector-1.0.9.dist-info/entry_points.txt +4 -0
{ytcollector-1.0.8.dist-info → ytcollector-1.0.9.dist-info}/top_level.txt +0 -1
config/settings.py +0 -39
ytcollector/utils.py +0 -144
ytcollector/verifier.py +0 -187
ytcollector-1.0.8.dist-info/METADATA +0 -105
ytcollector-1.0.8.dist-info/RECORD +0 -12
ytcollector-1.0.8.dist-info/entry_points.txt +0 -2
{ytcollector-1.0.8.dist-info → ytcollector-1.0.9.dist-info}/WHEEL +0 -0

ytcollector-1.0.9.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,207 @@
+Metadata-Version: 2.4
+Name: ytcollector
+Version: 1.0.9
+Summary: YouTube 콘텐츠 수집기 - 얼굴, 번호판, 타투, 텍스트 감지
+Author: YTCollector Team
+License: MIT
+Project-URL: Homepage, https://github.com/yourusername/ytcollector
+Project-URL: Documentation, https://github.com/yourusername/ytcollector#readme
+Project-URL: Repository, https://github.com/yourusername/ytcollector
+Keywords: youtube,downloader,video-analysis,face-detection,ocr
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+Requires-Dist: yt-dlp>=2024.0.0
+Provides-Extra: analysis
+Requires-Dist: opencv-python>=4.5.0; extra == "analysis"
+Requires-Dist: easyocr>=1.6.0; extra == "analysis"
+Requires-Dist: numpy>=1.20.0; extra == "analysis"
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0.0; extra == "dev"
+Requires-Dist: black>=23.0.0; extra == "dev"
+Requires-Dist: ruff>=0.1.0; extra == "dev"
+Provides-Extra: all
+Requires-Dist: ytcollector[analysis,dev]; extra == "all"
+# YouTube 콘텐츠 수집기
+유튜브에서 특정 카테고리(얼굴, 번호판, 타투, 텍스트)의 영상을 자동으로 검색, 다운로드, 분석하여 수집하는 CLI 도구입니다.
+## 설치
+### 필수 패키지
+```bash
+pip install yt-dlp
+```
+### 분석 기능용 패키지 (권장)
+```bash
+pip install opencv-python easyocr numpy
+```
+## 사용법
+### 기본 실행
+```bash
+python main.py
+```
+기본값: 얼굴 카테고리 5개, 최대 3분 영상
+### 옵션
+| 옵션 | 설명 | 기본값 |
+|------|------|--------|
+| `-c`, `--categories` | 수집할 카테고리 | `face` |
+| `-n`, `--count` | 카테고리당 다운로드 수 | `5` |
+| `-d`, `--duration` | 최대 영상 길이(분) | `3` |
+| `-o`, `--output` | 저장 경로 | `~/Downloads/youtube_collection` |
+| `--fast` | 고속 모드 (병렬 다운로드) | 비활성화 |
+| `-w`, `--workers` | 병렬 다운로드 수 | `3` |
+| `--proxy` | 프록시 주소 | 없음 |
+### 카테고리 종류
+| 카테고리 | 설명 | 검색 소스 |
+|----------|------|-----------|
+| `face` | 얼굴/인물 | SBS 인터뷰, 런닝맨, 미운우리새끼 등 |
+| `license_plate` | 자동차 번호판 | 중고차 매물, 세차 영상, 신차 출고 등 |
+| `tattoo` | 타투/문신 | 타투 시술, 타투이스트 작업 영상 |
+| `text` | 텍스트/자막 | SBS 예능 (런닝맨, 골목식당 등) |
+## 예시
+### 단일 카테고리
+```bash
+# 얼굴 영상 10개 수집
+python main.py -c face -n 10
+# 번호판 영상 수집 (최대 5분)
+python main.py -c license_plate -d 5
+# 타투 영상 수집
+python main.py -c tattoo -n 5
+```
+### 여러 카테고리
+```bash
+# 얼굴과 텍스트 각 10개씩
+python main.py -c face text -n 10
+# 모든 카테고리 수집
+python main.py -c face license_plate tattoo text -n 5
+```
+### 고속 모드
+```bash
+# 병렬 다운로드 (기본 3개 동시)
+python main.py -c face -n 10 --fast
+# 5개 동시 다운로드
+python main.py -c face -n 10 --fast -w 5
+```
+### 저장 경로 지정
+```bash
+python main.py -c face -o /path/to/save
+```
+### 프록시 사용
+```bash
+python main.py -c face --proxy http://proxy.server:8080
+```
+## SBS Dataset 구축 (URL 리스트 기반)
+URL 리스트를 기반으로 영상을 수집하고 특정 시점을 기준으로 자동으로 클리핑(3분 미만)하는 기능입니다.
+### 실행 방법
+```bash
+ytc-dataset youtube_url.txt
+```
+### youtube_url.txt 형식
+`URL, MM:SS, TaskName` 형식으로 작성합니다.
+```text
+https://www.youtube.com/watch?v=aqz-KE-bpKQ, 00:10, sample_task
+```
+### 상세 옵션
+| 옵션 | 설명 | 기본값 |
+|------|------|--------|
+| `file` | URL 리스트 파일 경로 | (필수) |
+| `-o`, `--output` | 저장 루트 경로 | `.` |
+### 특징
+- **자동 트리밍**: 지정된 MM:SS 시점 기준 $\pm$ 1.5분(총 3분)을 자동으로 자릅니다.
+- **중복 방지**: 인덱스 기반으로 이미 다운로드/클리핑된 영상은 건너뜁니다.
+- **저장 구조**: `./video/` (원본), `./video_clips/` (클립) 폴더가 생성됩니다.
+## 출력 폴더 구조
+```
+저장경로/
+├── 얼굴/              # 얼굴 감지된 영상
+├── 번호판/            # 번호판 감지된 영상
+├── 번호판_미감지/     # 번호판 미감지 (수동 확인용)
+├── 타투/              # 타투 감지된 영상
+├── 텍스트/            # 텍스트 감지된 영상
+└── .archive.txt       # 다운로드 기록 (중복 방지)
+```
+## 파일 구조
+```
+260202_test/
+├── main.py        # CLI 진입점
+├── config.py      # 설정 (검색어, UA 등)
+├── analyzer.py    # 영상 분석 (OpenCV, EasyOCR)
+├── downloader.py  # 다운로드 로직
+└── README.md      # 사용설명서
+```
+## 분석 기능
+| 감지 항목 | 사용 기술 | 설명 |
+|-----------|-----------|------|
+| 얼굴 | OpenCV Haar Cascade | 정면 얼굴 감지 |
+| 텍스트 | EasyOCR | 한국어/영어 문자 인식 |
+| 번호판 | EasyOCR + 정규식 | 번호판 패턴 매칭 |
+| 타투 | OpenCV HSV 분석 | 피부 영역 내 잉크 패턴 |
+## 주의사항
+- 영상은 다운로드 후 분석하여 해당 카테고리가 감지된 경우에만 저장됩니다
+- 번호판 카테고리는 미감지 영상도 별도 폴더에 보관됩니다 (수동 확인용)
+- 이미 다운로드한 영상은 자동으로 스킵됩니다 (`.archive.txt` 기록)
+- 비공개/삭제/저작권 영상은 자동 스킵됩니다
+## 고속 모드 vs 일반 모드
+| 항목 | 일반 모드 | 고속 모드 |
+|------|-----------|-----------|
+| 다운로드 | 순차 | 병렬 |
+| 딜레이 | 0.5~1.5초 | 없음 |
+| 재시도 | 3회 | 1회 |
+| 타임아웃 | 30초 | 10초 |
+고속 모드는 빠르지만 YouTube 차단 위험이 높아질 수 있습니다.

ytcollector-1.0.9.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,11 @@
+ytcollector/__init__.py,sha256=OkibE8GYgt1qwOmkiBNXywkGVdnMj5sVpVzDVPSRXQg,1094
+ytcollector/analyzer.py,sha256=JvppXAcoZ43lXJnGRX-dVGTSZ0QQ-IxBzF6ljT1BjJQ,6388
+ytcollector/cli.py,sha256=zOwnHs7kClOkcWHSUPXrVIPaZYKADMNCBsIosZEzmYc,5629
+ytcollector/config.py,sha256=w5Sx-jKdp4R-rCncDdOXc3WfSuH5OXkVRMIeMXL48VU,2216
+ytcollector/dataset_builder.py,sha256=HGVX_mR1W7_wBl2C5C6Cj43OCVseAGIYmg3-n8WLKuo,4598
+ytcollector/downloader.py,sha256=yQGGTR9ErjHlXHp_RXIDD3Zbl9geTyTHGROPO0nuxV8,12794
+ytcollector-1.0.9.dist-info/METADATA,sha256=bIEbwbhupi-Eo6HQ_4KCPRsM_09d6QK6HAnq2aMiNdM,6212
+ytcollector-1.0.9.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
+ytcollector-1.0.9.dist-info/entry_points.txt,sha256=waiVuSJJYt-6_DAal-T4JkHgejo7wKYLdKrEI7tZ-ms,127
+ytcollector-1.0.9.dist-info/top_level.txt,sha256=wozNyCUm0eMOm-9U81yTql6oGaM2O5rWVBXDb93zzyQ,12
+ytcollector-1.0.9.dist-info/RECORD,,

ytcollector-1.0.9.dist-info/entry_points.txt ADDED Viewed

@@ -0,0 +1,4 @@
+[console_scripts]
+ytc = ytcollector.cli:main
+ytc-dataset = ytcollector.dataset_builder:main
+ytcollector = ytcollector.cli:main

{ytcollector-1.0.8.dist-info → ytcollector-1.0.9.dist-info}/top_level.txt RENAMED Viewed

	@@ -1,2 +1 @@
1	- config
2 1	ytcollector

config/settings.py DELETED Viewed

@@ -1,39 +0,0 @@
-"""
-SBS Dataset Collection Pipeline - Settings
-"""
-from pathlib import Path
-# Base paths
-BASE_DIR = Path(__file__).parent.parent
-DATA_DIR = BASE_DIR / "data"
-URLS_DIR = DATA_DIR / "urls"
-VIDEOS_DIR = DATA_DIR / "videos"
-CLIPS_DIR = DATA_DIR / "clips"
-OUTPUTS_DIR = BASE_DIR / "outputs"
-REPORTS_DIR = OUTPUTS_DIR / "reports"
-# Video settings
-CLIP_DURATION_BEFORE = 90  # 1분 30초 (초 단위)
-CLIP_DURATION_AFTER = 90   # 1분 30초 (초 단위)
-MAX_CLIP_DURATION = 180    # 최대 3분
-# Download settings
-VIDEO_FORMAT = "best[ext=mp4]/best"
-DOWNLOAD_RETRIES = 3
-# YOLO-World settings
-YOLO_MODEL = "yolov8s-worldv2.pt"
-CONFIDENCE_THRESHOLD = 0.25
-FRAME_SAMPLE_RATE = 30  # 매 30프레임마다 샘플링 (약 1초)
-# Task-specific class prompts
-TASK_CLASSES = {
-    "face": ["human face", "person face", "close-up face"],
-    "license_plate": ["car license plate", "vehicle license plate", "korean license plate"],
-    "tattoo": ["tattoo", "body tattoo", "skin tattoo"],
-    "text": ["text on screen", "subtitle", "korean text", "caption"]
-}
-# Create directories if not exist
-for dir_path in [URLS_DIR, VIDEOS_DIR, CLIPS_DIR, REPORTS_DIR]:
-    dir_path.mkdir(parents=True, exist_ok=True)

ytcollector/utils.py DELETED Viewed

@@ -1,144 +0,0 @@
-"""
-Utility functions for the SBS Dataset Collection Pipeline
-"""
-from pathlib import Path
-from datetime import datetime
-import re
-import json
-def timestamp_to_seconds(minutes: int, seconds: int) -> int:
-    """분:초를 총 초로 변환"""
-    return minutes * 60 + seconds
-def seconds_to_timestamp(total_seconds: int) -> str:
-    """초를 MM:SS 형식으로 변환"""
-    minutes = total_seconds // 60
-    seconds = total_seconds % 60
-    return f"{minutes:02d}:{seconds:02d}"
-def extract_video_id(url: str) -> str:
-    """YouTube URL에서 video ID 추출"""
-    patterns = [
-        r'(?:v=|/)([0-9A-Za-z_-]{11}).*',
-        r'(?:embed/)([0-9A-Za-z_-]{11})',
-        r'(?:youtu\.be/)([0-9A-Za-z_-]{11})',
-    ]
-    for pattern in patterns:
-        match = re.search(pattern, url)
-        if match:
-            return match.group(1)
-    return url[-11:] if len(url) >= 11 else url
-def ensure_dir(path: Path) -> Path:
-    """디렉토리 생성 (없으면)"""
-    try:
-        path.mkdir(parents=True, exist_ok=True)
-    except PermissionError:
-        # 네트워크 드라이브 권한 문제 등
-        pass
-    return path
-def get_output_dir(base_dir: Path) -> Path:
-    """영상이 저장될 실제 디렉토리 반환"""
-    from .config import CUSTOM_OUTPUT_DIR
-    if CUSTOM_OUTPUT_DIR:
-        return ensure_dir(Path(CUSTOM_OUTPUT_DIR))
-    # 기본값: 프로젝트 폴더 내 video/ (단일 폴더 모드)
-    # 기존에는 video/{task_type}이었으나, 요구사항 변경으로 "한 폴더 안에" 저장
-    return ensure_dir(base_dir / "video")
-def get_next_filename(output_dir: Path, task_type: str) -> str:
-    """
-    순차적인 파일명 생성 (task_0001.mp4)
-    폴더 내의 기존 파일을 스캔하여 가장 큰 번호 + 1 반환
-    """
-    # glob은 느릴 수 있으므로 파일이 많아지면 최적화 필요
-    # 현재는 100개 제한이므로 괜찮음
-    existing_files = list(output_dir.glob(f"{task_type}_*.mp4"))
-    max_num = 0
-    pattern = re.compile(rf"{task_type}_(\d{{4}})\.mp4")
-    for file_path in existing_files:
-        match = pattern.match(file_path.name)
-        if match:
-            num = int(match.group(1))
-            if num > max_num:
-                max_num = num
-    next_num = max_num + 1
-    return f"{task_type}_{next_num:04d}"
-def get_clip_path(base_dir: Path, task_type: str, filename: str = None) -> Path:
-    """클립 저장 경로 반환"""
-    # filename이 None이면 순차적 이름 생성
-    output_dir = get_output_dir(base_dir)
-    if filename is None:
-        filename_str = get_next_filename(output_dir, task_type)
-        return output_dir / f"{filename_str}.mp4"
-    # 확장자 보정
-    if not filename.endswith('.mp4'):
-        filename += '.mp4'
-    return output_dir / filename
-def get_task_video_count(base_dir: Path, task_type: str) -> int:
-    """해당 태스크의 영상 개수 확인 (파일명 기준)"""
-    output_dir = get_output_dir(base_dir)
-    return len(list(output_dir.glob(f"{task_type}_*.mp4")))
-def load_history(base_dir: Path) -> dict:
-    """다운로드 히스토리 로드 (URL 중복 방지용)"""
-    # 히스토리 파일은 항상 프로젝트 로컬 폴더에 저장 (네트워크 공유 X)
-    history_path = base_dir / "download_history.json"
-    if history_path.exists():
-        try:
-            return json.loads(history_path.read_text(encoding='utf-8'))
-        except:
-            return {}
-    return {}
-def save_history(base_dir: Path, history: dict):
-    """다운로드 히스토리 저장"""
-    history_path = base_dir / "download_history.json"
-    history_path.write_text(json.dumps(history, indent=2, ensure_ascii=False), encoding='utf-8')
-def get_url_file_path(base_dir: Path, task_type: str) -> Path:
-    """URL 파일 경로 반환"""
-    # URL 파일은 로컬 urls/task_type/youtube_url.txt
-    task_dir = ensure_dir(base_dir / "urls" / task_type)
-    return task_dir / "youtube_url.txt"
-def get_report_path(base_dir: Path, task_type: str, filename: str) -> Path:
-    """리포트 저장 경로 반환"""
-    task_dir = ensure_dir(base_dir / "outputs" / "reports" / task_type)
-    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-    return task_dir / f"{filename}_report_{timestamp}.json"
-def validate_url(url: str) -> bool:
-    """YouTube URL 유효성 검사"""
-    youtube_patterns = [
-        r'(https?://)?(www\.)?youtube\.com/watch\?v=',
-        r'(https?://)?(www\.)?youtu\.be/',
-        r'(https?://)?(www\.)?youtube\.com/embed/',
-    ]
-    return any(re.match(pattern, url) for pattern in youtube_patterns)

ytcollector/verifier.py DELETED Viewed

@@ -1,187 +0,0 @@
-"""
-YOLO-World Verifier Module
-YOLO-World 기반 객체 탐지 및 클래스 검증
-"""
-import json
-from pathlib import Path
-from typing import List, Dict, Optional
-import logging
-from datetime import datetime
-import cv2
-from tqdm import tqdm
-# Updated imports for new package structure
-from .config import (
-    YOLO_MODEL,
-    CONFIDENCE_THRESHOLD,
-    FRAME_SAMPLE_RATE,
-    TASK_CLASSES,
-)
-from .utils import get_report_path
-logger = logging.getLogger(__name__)
-class YOLOWorldVerifier:
-    """YOLO-World 기반 영상 검증 클래스"""
-    def __init__(self, task_type: str, base_dir: Path = None, model_name: str = YOLO_MODEL):
-        self.task_type = task_type
-        self.base_dir = base_dir or Path.cwd()
-        self.model_name = model_name
-        self.model = None
-        self.classes = TASK_CLASSES.get(task_type, [])
-        if not self.classes:
-            raise ValueError(f"Unknown task type: {task_type}")
-    def load_model(self):
-        """YOLO-World 모델 로드"""
-        if self.model is None:
-            from ultralytics import YOLOWorld
-            logger.info(f"Loading YOLO-World model: {self.model_name}")
-            self.model = YOLOWorld(self.model_name)
-            logger.info(f"Setting classes for {self.task_type}: {self.classes}")
-            self.model.set_classes(self.classes)
-        return self.model
-    def verify_frame(self, frame) -> List[Dict]:
-        """단일 프레임에서 객체 탐지"""
-        model = self.load_model()
-        results = model.predict(frame, conf=CONFIDENCE_THRESHOLD, verbose=False)
-        detections = []
-        for result in results:
-            boxes = result.boxes
-            for box in boxes:
-                detection = {
-                    'class_id': int(box.cls[0]),
-                    'class_name': self.classes[int(box.cls[0])] if int(box.cls[0]) < len(self.classes) else 'unknown',
-                    'confidence': float(box.conf[0]),
-                    'bbox': box.xyxy[0].tolist(),
-                }
-                detections.append(detection)
-        return detections
-    def verify_video(
-        self,
-        video_path: Path,
-        sample_rate: int = FRAME_SAMPLE_RATE
-    ) -> Dict:
-        """영상 전체 검증"""
-        logger.info(f"Verifying video: {video_path}")
-        cap = cv2.VideoCapture(str(video_path))
-        if not cap.isOpened():
-            raise ValueError(f"Cannot open video: {video_path}")
-        fps = cap.get(cv2.CAP_PROP_FPS)
-        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
-        duration = total_frames / fps if fps > 0 else 0
-        frame_results = []
-        detection_count = 0
-        frames_with_detection = 0
-        frame_idx = 0
-        pbar = tqdm(total=total_frames // sample_rate, desc="Verifying")
-        while True:
-            ret, frame = cap.read()
-            if not ret:
-                break
-            if frame_idx % sample_rate == 0:
-                detections = self.verify_frame(frame)
-                if detections:
-                    frames_with_detection += 1
-                    detection_count += len(detections)
-                    frame_results.append({
-                        'frame_idx': frame_idx,
-                        'timestamp_sec': frame_idx / fps if fps > 0 else 0,
-                        'detections': detections,
-                    })
-                pbar.update(1)
-            frame_idx += 1
-        cap.release()
-        pbar.close()
-        sampled_frames = max(1, total_frames // sample_rate)
-        detection_rate = frames_with_detection / sampled_frames
-        result = {
-            'video_path': str(video_path),
-            'task_type': self.task_type,
-            'classes': self.classes,
-            'summary': {
-                'total_frames': total_frames,
-                'sampled_frames': sampled_frames,
-                'fps': fps,
-                'duration_sec': duration,
-                'frames_with_detection': frames_with_detection,
-                'total_detections': detection_count,
-                'detection_rate': detection_rate,
-            },
-            'frame_results': frame_results,
-            'verified_at': datetime.now().isoformat(),
-            'model': self.model_name,
-            'is_valid': detection_rate > 0.01,  # 1% 이상 탐지되면 유효한 것으로 간주 (기존 10%에서 하향)
-        }
-        logger.info(
-            f"Verification complete: {frames_with_detection}/{sampled_frames} frames "
-            f"({detection_rate:.1%}) with {self.task_type} detected"
-        )
-        return result
-    def save_report(self, result: Dict, output_path: Optional[Path] = None) -> Path:
-        """검증 결과 JSON 저장"""
-        if output_path is None:
-            video_name = Path(result['video_path']).stem
-            output_path = get_report_path(self.base_dir, self.task_type, video_name)
-        with open(output_path, 'w', encoding='utf-8') as f:
-            json.dump(result, f, ensure_ascii=False, indent=2)
-        logger.info(f"Report saved to: {output_path}")
-        return output_path
-def verify_clip(video_path: Path, task_type: str, base_dir: Path = None) -> Dict:
-    """클립 검증 헬퍼 함수"""
-    verifier = YOLOWorldVerifier(task_type, base_dir)
-    result = verifier.verify_video(video_path)
-    verifier.save_report(result)
-    return result
-def batch_verify(video_dir: Path, task_type: str, base_dir: Path = None) -> List[Dict]:
-    """디렉토리 내 모든 영상 일괄 검증"""
-    verifier = YOLOWorldVerifier(task_type, base_dir)
-    results = []
-    video_files = list(video_dir.glob("*.mp4"))
-    logger.info(f"Found {len(video_files)} videos to verify")
-    for video_path in video_files:
-        try:
-            result = verifier.verify_video(video_path)
-            verifier.save_report(result)
-            results.append(result)
-        except Exception as e:
-            logger.error(f"Failed to verify {video_path}: {e}")
-            results.append({'video_path': str(video_path), 'error': str(e), 'is_valid': False})
-    return results

ytcollector 1.0.8__py3-none-any.whl → 1.0.9__py3-none-any.whl

ytcollector 1.0.8py3-none-any.whl → 1.0.9py3-none-any.whl