PyPI - ureca_document_parser - Versions diffs - 0.0.1__tar.gz - Mend

ureca_document_parser 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (48) hide show

ureca_document_parser-0.0.1/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,53 @@
+name: CI
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.12", "3.13"]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+      - name: Set up Python ${{ matrix.python-version }}
+        run: uv python install ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: uv sync --extra dev
+      - name: Run tests
+        run: uv run pytest tests/ -v --tb=short
+      - name: Run tests with coverage
+        run: uv run pytest tests/ --cov=ureca_document_parser --cov-report=term-missing
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+      - name: Set up Python
+        run: uv python install 3.12
+      - name: Install dependencies
+        run: uv sync --extra dev
+      - name: Run ruff check
+        run: uv run ruff check src/
+      - name: Run ruff format check
+        run: uv run ruff format --check src/

ureca_document_parser-0.0.1/.github/workflows/docs.yml ADDED Viewed

@@ -0,0 +1,27 @@
+name: Deploy Docs
+on:
+  push:
+    branches: [main]
+permissions:
+  contents: write
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+      - name: Set up Python
+        run: uv python install 3.12
+      - name: Install dependencies
+        run: uv sync --extra docs
+      - name: Build and deploy docs
+        run: uv run mkdocs gh-deploy --force

ureca_document_parser-0.0.1/.github/workflows/publish.yml ADDED Viewed

@@ -0,0 +1,36 @@
+name: Publish to PyPI
+on:
+  release:
+    types: [published]
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+    permissions:
+      id-token: write  # OIDC for trusted publishing
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          enable-cache: true
+      - name: Set up Python
+        run: uv python install 3.12
+      - name: Install dependencies
+        run: uv sync
+      - name: Run tests
+        run: uv run pytest tests/ -v
+      - name: Build package
+        run: uv build
+      - name: Publish to PyPI
+        run: uv publish
+        env:
+          UV_PUBLISH_TOKEN: ${{ secrets.PYPI_API_TOKEN }}

ureca_document_parser-0.0.1/.gitignore ADDED Viewed

@@ -0,0 +1,10 @@
+__pycache__
+.venv
+*.log
+*.lock
+dist
+*.log
+.ruff_cache/
+.pytest_cache/
+site/

ureca_document_parser-0.0.1/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.12

ureca_document_parser-0.0.1/CLAUDE.md ADDED Viewed

@@ -0,0 +1,143 @@
+# CLAUDE.md
+## 프로젝트 개요
+`ureca_document_parser` — 한국어 워드프로세서(아래한글) HWP/HWPX 파일을 Markdown 또는 LangChain Document 청크로 변환하는 다중 포맷 문서 파서. PyPI 배포 가능한 패키지로, 클린 아키텍처 기반으로 새 포맷 확장이 용이하다.
+## 명령어
+```bash
+uv sync
+uv run ureca_document_parser <file.hwp|file.hwpx> -o <output.md>
+uv run ureca_document_parser --list-formats
+uv run python -m ureca_document_parser <file.hwp> -o <output.md>
+uv run pytest tests/ -v
+uv build
+```
+## 아키텍처
+**파이프라인**: 입력 파일 → 포맷 레지스트리 → 파서 → Document 모델 → Writer → 출력 (또는 → TextSplitter → LangChain Documents)
+```
+src/ureca_document_parser/
+├── __init__.py        # 공개 API (convert, convert_to_chunks, get_registry)
+├── __main__.py        # python -m ureca_document_parser
+├── cli.py             # CLI (argparse, 레지스트리 기반 자동 라우팅)
+├── models.py          # Document 모델 (Paragraph, Table, Image, ListItem, ...)
+├── protocols.py       # Parser / Writer Protocol (구조적 서브타이핑)
+├── registry.py        # FormatRegistry (확장자→파서, 포맷명→Writer 매핑, 스레드 안전 싱글톤)
+├── styles.py          # 공유 헤딩 패턴
+├── hwp/
+│   ├── __init__.py    # HwpParser 및 저수준 타입 re-export
+│   ├── parser.py      # HWP v5 바이너리 파서 (olefile) — 오케스트레이션
+│   ├── records.py     # 바이너리 레코드 파싱 (Record, RecordCursor, 상수)
+│   ├── text.py        # 문자 스캐닝 및 텍스트 추출 (CharInfo, BSTR)
+│   └── tables.py      # 3단계 테이블 추출
+├── hwpx/
+│   ├── __init__.py    # HwpxParser re-export
+│   └── parser.py      # HWPX 파서 (zipfile + xml.etree)
+└── writers/
+    └── markdown.py    # Markdown 작성기
+```
+### 주요 모듈
+- **`protocols.py`** — `Parser` / `Writer` Protocol. 상속 없이 정적 메서드 시그니처만 맞추면 된다.
+- **`registry.py`** — `FormatRegistry`가 확장자→파서, 포맷명→Writer를 매핑. `get_registry()`로 스레드 안전 싱글톤 접근.
+- **`models.py`** — 공유 문서 모델. `Document` = `list[DocumentElement]` + `Metadata`. 파싱 실패 시 `ParseError`.
+- **`hwp/`** — HWP v5 바이너리 파서. `records.py`(레코드 스트림), `text.py`(문자 추출), `tables.py`(테이블 파싱), `parser.py`(오케스트레이션)로 분리.
+- **`hwpx/parser.py`** — HWPX (ZIP+XML) 파서. 표준 라이브러리 `xml.etree` 사용.
+- **`writers/markdown.py`** — `Document`를 Markdown으로 변환. 연속 `ListItem`을 하나의 블록으로 그룹핑.
+### 라이브러리 사용 예시
+```python
+from ureca_document_parser import convert
+convert("보고서.hwp", "보고서.md")
+from ureca_document_parser import convert_to_chunks
+chunks = convert_to_chunks("보고서.hwp", chunk_size=1000, chunk_overlap=200)
+```
+## 테스트
+```bash
+uv run pytest tests/ -v              # 전체 테스트
+uv run pytest tests/ --cov           # 커버리지 포함
+```
+테스트 범위: 모델, 레지스트리, CLI, HWP 파서 (단위 + 통합), HWPX 파서, Markdown 작성기.
+## 포맷 확장
+`docs/reference/extending.md` 참고. Protocol에 맞는 파서/Writer 클래스를 작성하고 `registry.py:_auto_register()`에 등록한다.
+## 의존성
+필수: `olefile`. 선택: `langchain-text-splitters`+`langchain-core` (청크 분할), `pymupdf` (PDF), `pillow`+`pytesseract` (OCR). 개발: `pytest`, `pytest-cov`, `mypy`, `ruff`.
+## 문서
+### 구조
+```
+docs/
+├── index.md              # 홈 — 퀵스타트, 주요 기능
+├── installation.md       # 설치 방법 (기본 + 선택적 의존성)
+├── formats/              # 포맷별 상세 가이드
+│   ├── hwp.md           # HWP 포맷 (개요 + 파일 구조 + 사용 예시)
+│   └── hwpx.md          # HWPX 포맷 (개요 + 파일 구조 + 사용 예시)
+├── guides/               # 사용 가이드
+│   ├── cli.md           # CLI 사용법
+│   ├── python-api.md    # Python API 기본 사용법
+│   ├── langchain.md     # LangChain 연동 (RAG)
+│   └── advanced.md      # 고급 사용법 (Document 모델 직접 다루기)
+├── api-reference.md     # API 레퍼런스 (convert, convert_to_chunks, get_registry 등, mkdocstrings 자동 생성)
+└── reference/            # 기술 참조 (기여자용)
+    ├── architecture.md  # 내부 아키텍처 (파이프라인, 모듈 의존성, 구현 세부사항)
+    └── extending.md     # 새 파서/Writer 추가 가이드
+```
+### 작성 규칙
+- 말투: es-toolkit 스타일 친근한 존댓말 (`~예요`, `~해요`, `~돼요`)
+- 코드와 전용 용어를 제외한 모든 텍스트는 한글로 작성한다.
+- 관점: **외부 프로젝트에 설치해서 쓰는 사용자** 기준. 내부 소스코드를 복붙하지 않는다.
+- CLI 예제는 반드시 `uv run ureca_document_parser ...` 형태로 작성한다.
+- 예제 파일명은 실제 사용 시나리오 기반 (예: `보고서.hwp`, `제안서.hwpx`)
+- 외부 의존성을 언급할 때는 **이름에 공식문서 링크**를 걸고, 바로 아래에 `uv add` 코드블록을 넣는다.
+- `api-reference.md`는 `mkdocstrings`가 docstring에서 자동 생성하므로 최소한의 설명만 작성한다.
+- `docs/reference/` 하위 문서는 기여자(contributor) 또는 깊이 있는 이해가 필요한 사용자 관점으로 작성한다.
+- Mermaid 다이어그램 사용 가능 (mkdocs.yml에 설정 완료)
+- MkDocs admonition 사용 가능: `!!! note`, `!!! info`, `!!! warning`
+### 빌드 및 미리보기
+```bash
+uv sync --extra docs                    # 문서 의존성 설치
+uv run mkdocs serve                     # http://127.0.0.1:8000 로컬 미리보기
+uv run mkdocs build                     # site/ 디렉토리에 정적 파일 빌드
+```
+### 배포
+배포는 자동이다. `main` 브랜치에 push하면 `.github/workflows/docs.yml`이 실행되어 GitHub Pages에 배포된다.
+- 워크플로우: `mkdocs gh-deploy --force` → `gh-pages` 브랜치에 push
+- Pages 설정: Source = `gh-pages` 브랜치 (GitHub Settings → Pages)
+- URL: https://ureca-corp.github.io/document_parser/
+수동 배포가 필요한 경우:
+```bash
+uv run mkdocs gh-deploy --force
+```
+### 네비게이션
+페이지를 추가/삭제하면 `mkdocs.yml`의 `nav:` 섹션을 함께 수정해야 한다.
+## CI
+GitHub Actions (`.github/workflows/ci.yml`) — `main` 브랜치 push/PR 시 실행. Python 3.12 + 3.13 테스트 매트릭스, ruff 린트/포맷 검사.

ureca_document_parser-0.0.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,45 @@
+Metadata-Version: 2.4
+Name: ureca_document_parser
+Version: 0.0.1
+Summary: Multi-format document parser and converter (HWP, HWPX, PDF, Image)
+Project-URL: Homepage, https://ureca-corp.github.io/document_parser/
+Project-URL: Documentation, https://ureca-corp.github.io/document_parser/
+Project-URL: Repository, https://github.com/ureca-corp/document_parser
+Project-URL: Issues, https://github.com/ureca-corp/document_parser/issues
+Author-email: Ureca Enterprise Corp <andy@ureca.im>
+License: MIT
+Keywords: converter,document,hwp,hwpx,markdown,parser
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Text Processing :: Markup
+Requires-Python: >=3.12
+Requires-Dist: olefile>=0.47
+Provides-Extra: all
+Requires-Dist: langchain-core>=0.2; extra == 'all'
+Requires-Dist: langchain-text-splitters>=0.2; extra == 'all'
+Requires-Dist: mkdocs-material>=9.5; extra == 'all'
+Requires-Dist: mkdocs>=1.6; extra == 'all'
+Requires-Dist: mkdocstrings[python]>=0.27; extra == 'all'
+Requires-Dist: pillow>=10.0; extra == 'all'
+Requires-Dist: pymupdf>=1.24; extra == 'all'
+Requires-Dist: pytesseract>=0.3; extra == 'all'
+Provides-Extra: dev
+Requires-Dist: mypy>=1.10; extra == 'dev'
+Requires-Dist: pytest-cov>=5.0; extra == 'dev'
+Requires-Dist: pytest>=8.0; extra == 'dev'
+Requires-Dist: ruff>=0.4; extra == 'dev'
+Provides-Extra: docs
+Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
+Requires-Dist: mkdocs>=1.6; extra == 'docs'
+Requires-Dist: mkdocstrings[python]>=0.27; extra == 'docs'
+Provides-Extra: langchain
+Requires-Dist: langchain-core>=0.2; extra == 'langchain'
+Requires-Dist: langchain-text-splitters>=0.2; extra == 'langchain'
+Provides-Extra: ocr
+Requires-Dist: pillow>=10.0; extra == 'ocr'
+Requires-Dist: pytesseract>=0.3; extra == 'ocr'
+Provides-Extra: pdf
+Requires-Dist: pymupdf>=1.24; extra == 'pdf'

ureca_document_parser-0.0.1/README.md ADDED Viewed

File without changes

ureca_document_parser-0.0.1/docs/api-reference.md ADDED Viewed

@@ -0,0 +1,247 @@
+# API 레퍼런스
+`ureca_document_parser`의 공개 API 문서예요.
+## 최상위 함수
+패키지에서 직접 import해서 사용하는 주요 함수들이에요.
+### convert()
+파일을 변환해서 저장해요.
+::: ureca_document_parser.convert
+    options:
+      heading_level: 3
+      show_source: false
+**예시:**
+```python
+from ureca_document_parser import convert
+convert("보고서.hwp", "보고서.md")
+```
+---
+### convert_to_chunks()
+파일을 파싱하고 LangChain 청크로 분할해요.
+!!! note "선택적 의존성 필요"
+    이 함수를 사용하려면 `langchain` 추가 의존성이 필요해요.
+    ```bash
+    uv add "ureca_document_parser[langchain]"
+    ```
+::: ureca_document_parser.convert_to_chunks
+    options:
+      heading_level: 3
+      show_source: false
+**예시:**
+```python
+from ureca_document_parser import convert_to_chunks
+chunks = convert_to_chunks("보고서.hwp", chunk_size=1000, chunk_overlap=200)
+for chunk in chunks:
+    print(chunk.page_content)
+    print(chunk.metadata)
+```
+---
+### get_registry()
+포맷 레지스트리 싱글톤을 반환해요.
+::: ureca_document_parser.get_registry
+    options:
+      heading_level: 3
+      show_source: false
+**예시:**
+```python
+from ureca_document_parser import get_registry
+registry = get_registry()
+doc = registry.parse("보고서.hwp")
+markdown = registry.write(doc, "markdown")
+```
+---
+## Document 모델
+파싱 결과를 표현하는 데이터 모델이에요.
+### Document
+문서 전체를 나타내요.
+::: ureca_document_parser.Document
+    options:
+      heading_level: 3
+      show_source: false
+      members:
+        - elements
+        - metadata
+---
+### Metadata
+문서 메타데이터를 담고 있어요.
+::: ureca_document_parser.Metadata
+    options:
+      heading_level: 3
+      show_source: false
+---
+### Paragraph
+문단 요소를 나타내요.
+::: ureca_document_parser.Paragraph
+    options:
+      heading_level: 3
+      show_source: false
+---
+### Table
+표 요소를 나타내요.
+::: ureca_document_parser.Table
+    options:
+      heading_level: 3
+      show_source: false
+---
+### TableRow
+표의 행을 나타내요.
+::: ureca_document_parser.TableRow
+    options:
+      heading_level: 3
+      show_source: false
+---
+### TableCell
+표의 셀을 나타내요.
+::: ureca_document_parser.TableCell
+    options:
+      heading_level: 3
+      show_source: false
+---
+### ListItem
+리스트 아이템을 나타내요.
+::: ureca_document_parser.ListItem
+    options:
+      heading_level: 3
+      show_source: false
+---
+### Image
+이미지 요소를 나타내요.
+::: ureca_document_parser.Image
+    options:
+      heading_level: 3
+      show_source: false
+---
+### Link
+링크 요소를 나타내요.
+::: ureca_document_parser.Link
+    options:
+      heading_level: 3
+      show_source: false
+---
+### HorizontalRule
+구분선 요소를 나타내요.
+::: ureca_document_parser.HorizontalRule
+    options:
+      heading_level: 3
+      show_source: false
+---
+## 예외
+### ParseError
+파싱 실패 시 발생하는 예외예요.
+::: ureca_document_parser.ParseError
+    options:
+      heading_level: 3
+      show_source: false
+**예시:**
+```python
+from ureca_document_parser import get_registry, ParseError
+registry = get_registry()
+try:
+    doc = registry.parse("손상된파일.hwp")
+except ParseError as e:
+    print(f"파싱 실패: {e}")
+```
+---
+## Protocol
+새 파서나 Writer를 추가할 때 구현해야 하는 인터페이스예요. 자세한 내용은 [포맷 확장 가이드](reference/extending.md)를 참고하세요.
+### Parser
+::: ureca_document_parser.Parser
+    options:
+      heading_level: 3
+      show_source: false
+---
+### Writer
+::: ureca_document_parser.Writer
+    options:
+      heading_level: 3
+      show_source: false
+---
+## 더 알아보기
+- [고급 사용법](guides/advanced.md) — Document 모델 직접 다루기
+- [포맷 확장하기](reference/extending.md) — 새 파서/Writer 추가하기