PyPI - entityxtract - Versions diffs - 0.5.2__tar.gz - Mend

entityxtract 0.5.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

entityxtract-0.5.2/.env.sample +3 -0
entityxtract-0.5.2/.github/workflows/publish.yml +41 -0
entityxtract-0.5.2/.gitignore +20 -0
entityxtract-0.5.2/.python-version +1 -0
entityxtract-0.5.2/LICENSE +21 -0
entityxtract-0.5.2/PKG-INFO +320 -0
entityxtract-0.5.2/README.md +289 -0
entityxtract-0.5.2/docs/assets/entityxtract_flow.png +0 -0
entityxtract-0.5.2/docs/assets/logo.png +0 -0
entityxtract-0.5.2/memory-bank/activeContext.md +107 -0
entityxtract-0.5.2/memory-bank/productContext.md +160 -0
entityxtract-0.5.2/memory-bank/progress.md +144 -0
entityxtract-0.5.2/memory-bank/projectbrief.md +182 -0
entityxtract-0.5.2/memory-bank/systemPatterns.md +206 -0
entityxtract-0.5.2/memory-bank/techContext.md +86 -0
entityxtract-0.5.2/pyproject.toml +44 -0
entityxtract-0.5.2/src/entityxtract/__init__.py +0 -0
entityxtract-0.5.2/src/entityxtract/config.py +46 -0
entityxtract-0.5.2/src/entityxtract/extractor.py +433 -0
entityxtract-0.5.2/src/entityxtract/extractor_types.py +254 -0
entityxtract-0.5.2/src/entityxtract/logging_config.py +118 -0
entityxtract-0.5.2/src/entityxtract/pdf/__init__.py +0 -0
entityxtract-0.5.2/src/entityxtract/pdf/converter.py +95 -0
entityxtract-0.5.2/src/entityxtract/pdf/extractor.py +141 -0
entityxtract-0.5.2/src/entityxtract/prompts/__init__.py +33 -0
entityxtract-0.5.2/src/entityxtract/prompts/string.txt +36 -0
entityxtract-0.5.2/src/entityxtract/prompts/system.txt +2 -0
entityxtract-0.5.2/src/entityxtract/prompts/table.txt +38 -0
entityxtract-0.5.2/tests/__init__.py +0 -0
entityxtract-0.5.2/tests/data/attention-is-all-you-need.pdf +0 -0
entityxtract-0.5.2/tests/test.py +139 -0
entityxtract-0.5.2/tests/utils_io.py +25 -0
entityxtract-0.5.2/uv.lock +1362 -0

entityxtract-0.5.2/.env.sample ADDED Viewed

@@ -0,0 +1,3 @@
+OPENAI_API_KEY: "your-api-key"
+OPENAI_API_BASE: "https://openrouter.ai/api/v1"
+OPENAI_DEFAULT_MODEL: "google/gemini-2.5-flash"

entityxtract-0.5.2/.github/workflows/publish.yml ADDED Viewed

@@ -0,0 +1,41 @@
+name: Publish to PyPI
+on:
+  release:
+    types: [published]
+jobs:
+  build:
+    name: Build distribution
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+      - name: Build package
+        run: uv build
+      - name: Upload build artifacts
+        uses: actions/upload-artifact@v4
+        with:
+          name: dist
+          path: dist/
+  publish:
+    name: Publish to PyPI
+    needs: build
+    runs-on: ubuntu-latest
+    environment: pypi
+    permissions:
+      id-token: write  # Required for trusted publishing via OIDC
+    steps:
+      - name: Download build artifacts
+        uses: actions/download-artifact@v4
+        with:
+          name: dist
+          path: dist/
+      - name: Publish to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1

entityxtract-0.5.2/.gitignore ADDED Viewed

@@ -0,0 +1,20 @@
+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv
+.vscode
+config.yml
+.env
+# Logs
+logs/
+# Misc
+.DS_Store

entityxtract-0.5.2/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.12

entityxtract-0.5.2/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Prathamesh Ghatole
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

entityxtract-0.5.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,320 @@
+Metadata-Version: 2.4
+Name: entityxtract
+Version: 0.5.2
+Summary: A provider-agnostic, entity-centric LLM-powered document entity extraction tool
+Project-URL: Homepage, https://github.com/Prathamesh-Ghatole/entityxtract
+Project-URL: Repository, https://github.com/Prathamesh-Ghatole/entityxtract
+Project-URL: Issues, https://github.com/Prathamesh-Ghatole/entityxtract/issues
+Author-email: Prathamesh-Ghatole <prathamesh.s.ghatole@gmail.com>
+License: MIT
+License-File: LICENSE
+Keywords: ai,document,entity,extraction,llm,nlp,pdf,structured-data
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.12
+Requires-Dist: fastapi[standard]>=0.116.1
+Requires-Dist: langchain-openai>=0.3.32
+Requires-Dist: langchain>=0.3.27
+Requires-Dist: pillow>=11.3.0
+Requires-Dist: polars>=1.33.0
+Requires-Dist: pydantic>=2.11.7
+Requires-Dist: pypdfium2>=4.30.0
+Requires-Dist: python-dotenv>=1.1.1
+Requires-Dist: requests>=2.32.5
+Requires-Dist: xlsxwriter>=3.2.5
+Description-Content-Type: text/markdown
+<!-- <p align="center">
+  <a href="https://github.com/Prathamesh-Ghatole/entityxtract">
+    <img loading="lazy" alt="entityxtract" src="https://github.com/Prathamesh-Ghatole/entityxtract/raw/main/docs/assets/logo.png" width="50%"/>
+  </a>
+</p> -->
+# entityxtract
+[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
+[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
+[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
+[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
+[![License MIT](https://img.shields.io/github/license/Prathamesh-Ghatole/entityxtract)](https://opensource.org/licenses/MIT)
+**Entity-first, schema-driven extraction of structured data from unstructured documents** (PDF, DOCX, TXT, images). Define custom entities with schemas, few-shot examples, and instructions, then extract reliably using any local or SOTA LLM.
+Built as an **open-source alternative** to Google Cloud Document AI, Azure AI Document Intelligence, and Adobe PDF Extract — but provider-agnostic and designed to work with any LLM.
+<p align="center">
+  <a href="https://github.com/Prathamesh-Ghatole/entityxtract">
+    <img loading="lazy" alt="entityxtract" src="https://github.com/Prathamesh-Ghatole/entityxtract/raw/main/docs/assets/entityxtract_flow.png" width="100%"/>
+  </a>
+</p>
+## Features
+* 🎯 **Entity-first extraction** — Smart structured data extraction with pre-defined / auto-identified entities.
+* 📄 **Multiple document formats** — Support for PDF, TXT, MD, and images.
+* 🔀 **Smart input modes** — Extract information using text, OCR, or hybrid approaches.
+* 🌐 **Provider-agnostic design** — Works with any LLM via OpenAI-compatible APIs.
+* 🔄 **Robust execution** — Built-in retries, parallel extraction, strictly structured and typed output.
+* 📊 **Observability** — Structured logs, token usage tracking, and optional cost tracking.
+* 📦 **PyPI Package** — Easily install and use entityxtract in your projects.
+### Coming Soon
+* 🌐 **FastAPI REST API** for remote extraction services.
+* 🖥️ **Web UI** for visual entity/schema management and job monitoring.
+* 🔍 **Auto-detect mode** to automatically identify extractable entities in documents.
+* 💰 **Cost Optimization** using PDF annotation caching, and smart input data pruning.
+* 👁️ **Deepseek OCR** integration for enhanced document processing.
+* 🔌 **MCP server** for agentic applications.
+## Installation
+To use entityxtract, you'll need Python 3.12+ and [uv](https://docs.astral.sh/uv/) (recommended):
+```bash
+# Install uv if you haven't already
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# Clone the repository
+git clone https://github.com/Prathamesh-Ghatole/entityxtract.git
+cd entityxtract
+# Install dependencies
+uv sync
+```
+## Getting Started
+Extract pre-defined entities:
+```python
+from pathlib import Path
+import polars as pl
+from entityxtract.extractor_types import (
+    Document, TableToExtract, ObjectsToExtract,
+    ExtractionConfig, FileInputMode
+)
+from entityxtract.extractor import extract_objects
+# 1. Load your document
+doc = Document(Path("document.pdf"))
+# 2. Define what to extract
+table = TableToExtract(
+    name="Events",
+    example_table=pl.DataFrame([
+        {"Time": "02:05", "Type": "Operation", "Description": "Example event"},
+        {"Time": "03:25", "Type": "Transit", "Description": "Another event"}
+    ]),
+    instructions="Extract the events table with Time, Type, and Description columns.",
+    required=True
+)
+# 3. Configure extraction
+config = ExtractionConfig(
+    model_name="google/gemini-2.5-flash",  # Recommended
+    temperature=0.0,
+    file_input_modes=[FileInputMode.FILE]
+)
+# 4. Extract!
+results = extract_objects(doc, ObjectsToExtract(objects=[table], config=config))
+# Use your results
+for name, result in results.results.items():
+    if result.success:
+        df = pl.DataFrame(result.extracted_data)
+        print(df)
+    else:
+        print(f"Failed: {result.message}")
+```
+## Configuration
+Copy the sample environment file `.env.sample` to `.env`, or set the following environment variables directly:
+```bash
+# For all OpenAI-compatible endpoints [OpenAI, OpenRouter, Ollama, lm-studio, etc.]
+export OPENAI_API_KEY="your-api-key"
+export OPENAI_API_BASE="https://openrouter.ai/api/v1"
+# Default model
+export OPENAI_DEFAULT_MODEL="google/gemini-2.5-flash"
+```
+## Usage Examples
+### Complete Example with Multiple Entities
+```python
+from pathlib import Path
+import polars as pl
+from entityxtract.extractor_types import (
+    Document, ExtractionConfig, FileInputMode,
+    TableToExtract, StringToExtract, ObjectsToExtract
+)
+from entityxtract.extractor import extract_objects
+# Load document
+doc = Document(Path("reports/quarterly_summary.pdf"))
+# Define entities to extract
+table = TableToExtract(
+    name="Financial Summary",
+    example_table=pl.DataFrame([
+        {"Quarter": "Q1 2024", "Revenue": "$1.2M", "Expenses": "$800K", "Profit": "$400K"},
+        {"Quarter": "Q2 2024", "Revenue": "$1.5M", "Expenses": "$900K", "Profit": "$600K"}
+    ]),
+    instructions="Extract the quarterly financial summary table with Quarter, Revenue, Expenses, and Profit columns.",
+    required=True
+)
+report_id = StringToExtract(
+    name="Report ID",
+    example_string="RPT-2024-Q2-001",
+    instructions="Extract the report identifier from the document header.",
+    required=False
+)
+# Configure extraction with cost tracking
+config = ExtractionConfig(
+    model_name="google/gemini-2.5-flash",
+    temperature=0.0,
+    file_input_modes=[FileInputMode.FILE],
+    parallel_requests=4,
+    calculate_costs=True
+)
+# Run extraction
+objects = ObjectsToExtract(objects=[table, report_id], config=config)
+results = extract_objects(doc, objects)
+# Process results
+for name, res in results.results.items():
+    if res.success:
+        print(f"✓ [{name}] extracted successfully")
+        print(f"  Tokens: {res.input_tokens} in / {res.output_tokens} out")
+        print(f"  Cost: ${res.cost:.4f}")
+        # Export table to CSV
+        if isinstance(res.extracted_data, list):
+            df = pl.DataFrame(res.extracted_data)
+            df.write_csv(f"{name}.csv")
+            print(f"  Saved to {name}.csv")
+    else:
+        print(f"✗ [{name}] failed: {res.message}")
+print(f"\nTotals: {results.total_input_tokens} tokens in, {results.total_output_tokens} tokens out")
+print(f"Total cost: ${results.total_cost:.4f}")
+```
+### Different Input Modes
+```python
+# Pass document as file attachment
+config = ExtractionConfig(
+    model_name="google/gemini-2.5-flash",
+    file_input_modes=[FileInputMode.FILE]
+)
+# Pass document as text content
+config = ExtractionConfig(
+    model_name="google/gemini-2.5-flash",
+    file_input_modes=[FileInputMode.TEXT]
+)
+# Pass document as images (useful for scanned documents)
+config = ExtractionConfig(
+    model_name="google/gemini-2.5-flash",
+    file_input_modes=[FileInputMode.IMAGE]
+)
+# Combine multiple input modes
+config = ExtractionConfig(
+    model_name="google/gemini-2.5-flash",
+    file_input_modes=[FileInputMode.FILE, FileInputMode.TEXT]
+)
+```
+See `tests/test.py` for more complete examples.
+## Roadmap
+### Interfaces
+- 🌐 FastAPI REST API for remote extraction services
+- 🖥️ Web UI for entity management, job runs, and results review
+- 🤖 Auto-detect mode: automatically identify entities in documents
+### Developer Experience
+- 📦 Publish to PyPI for easy `pip install entityxtract`
+- ⚡ ENV-first configuration (deprecate YAML)
+- 💾 Document annotation caching to reduce token usage
+- 🔧 JSON import/export for entity schemas and results
+- 📝 Enhanced CLI with `entityxtract` command
+### Providers & Models
+- 🏠 Local inference via Ollama
+- 🔌 Native adapters for OpenAI, Gemini, Claude, and more
+- 🌍 Support for additional LLM providers
+### Quality & Testing
+- ✅ Expanded test coverage
+- 📊 Benchmark suite for accuracy and performance
+- 📚 Comprehensive documentation site
+## Comparisons
+entityxtract positions itself as a flexible, open-source alternative to both commercial services and closed-source solutions:
+**Key Differentiators:**
+- **Provider Agnostic**: Works with any LLM, not locked to a single provider
+- **Open Source**: Full transparency, customizable, and community-driven
+- **Schema + Examples**: Strong emphasis on structured entity definitions with few-shot learning
+- **Complete Stack**: Python SDK today, REST API and Web UI coming soon
+## Contributing
+We welcome contributions! entityxtract uses modern Python tooling:
+```bash
+# Use uv for environment management
+uv sync
+# Run tests
+uv run pytest tests/
+# Code formatting with Ruff
+uv run ruff check .
+uv run ruff format .
+```
+**Guidelines:**
+- Follow strict JSON output conventions
+- Include tests for new features
+- Update documentation as needed
+- Use structured logging patterns
+Open an issue or PR with a clear description and we'll be happy to review!
+## Get Help and Support
+- 💬 [GitHub Discussions](https://github.com/Prathamesh-Ghatole/entityxtract/discussions) - Ask questions and share ideas
+- 🐛 [Issues](https://github.com/Prathamesh-Ghatole/entityxtract/issues) - Report bugs or request features
+- 📧 Contact: prathamesh.s.ghatole@gmail.com
+## License
+entityxtract is released under the [MIT License](LICENSE). Free for commercial and personal use.
+---
+**Built with ❤️ by [Prathamesh Ghatole](https://github.com/Prathamesh-Ghatole)**
+*entityxtract was built out of the need for intelligent entity extraction from documents using AI with minimal effort. Define what you need, and let AI handle the rest.*