PyPI - graphrag-core - Versions diffs - 0.2.0__tar.gz - Mend

graphrag-core 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (86) hide show

graphrag_core-0.2.0/.github/workflows/release.yml ADDED Viewed

@@ -0,0 +1,28 @@
+name: Release
+on:
+  push:
+    tags:
+      - "v*"
+permissions:
+  id-token: write
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+    environment: pypi
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v4
+      - name: Set up Python
+        run: uv python install 3.12
+      - name: Build package
+        run: uv build
+      - name: Publish to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1

graphrag_core-0.2.0/.github/workflows/test.yml ADDED Viewed

@@ -0,0 +1,35 @@
+name: Tests
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v4
+      - name: Set up Python
+        run: uv python install 3.12
+      - name: Install dependencies
+        run: uv sync --all-extras
+      - name: Run unit tests
+        run: uv run pytest tests/ -x -q
+      - name: Boundary check — no domain leakage
+        run: |
+          FORBIDDEN="MonitoringTopic|Perspective|SubjectArea|Interview|Dalux|CapturePoint|InvestorAlert|SollIst|EY|Parthenon|Prague"
+          if grep -rn -E "$FORBIDDEN" src/; then
+            echo "DOMAIN LEAKAGE DETECTED in graphrag-core"
+            exit 1
+          else
+            echo "graphrag-core is clean"
+          fi

graphrag_core-0.2.0/.gitignore ADDED Viewed

@@ -0,0 +1,191 @@
+# ============================================================================
+# Cross-Platform Dotfiles .gitignore
+# ============================================================================
+# -----------------------------------------------------------------------------
+# macOS
+# -----------------------------------------------------------------------------
+.DS_Store
+.AppleDouble
+.LSOverride
+.DocumentRevisions-V100
+.fseventsd
+.Spotlight-V100
+.TemporaryItems
+.Trashes
+.VolumeIcon.icns
+.com.apple.timemachine.donotpresent
+.AppleDB
+.AppleDesktop
+Network Trash Folder
+Temporary Items
+.apdisk
+# Icon must end with two \r
+Icon
+# Thumbnails
+._*
+# -----------------------------------------------------------------------------
+# Linux
+# -----------------------------------------------------------------------------
+*~
+.directory
+.Trash-*
+.nfs*
+# -----------------------------------------------------------------------------
+# Windows
+# -----------------------------------------------------------------------------
+Thumbs.db
+Thumbs.db:encryptable
+ehthumbs.db
+ehthumbs_vista.db
+[Dd]esktop.ini
+$RECYCLE.BIN/
+*.lnk
+# -----------------------------------------------------------------------------
+# Shell & Terminal
+# -----------------------------------------------------------------------------
+# History files
+.bash_history
+.zsh_history
+.python_history
+.node_repl_history
+.lesshst
+# Zsh compiled files
+*.zwc
+*.zwc.old
+.zcompdump*
+# Shell local overrides
+*.local
+# -----------------------------------------------------------------------------
+# Editors
+# -----------------------------------------------------------------------------
+# Vim
+*.swp
+*.swo
+*.swn
+.*.sw?
+*~
+.netrwhist
+# VS Code (if you want to ignore workspace settings)
+# .vscode/
+# Emacs
+*~
+\#*\#
+.\#*
+.emacs.desktop
+.emacs.desktop.lock
+# -----------------------------------------------------------------------------
+# Security & Credentials
+# -----------------------------------------------------------------------------
+# Environment files
+.env
+.env.local
+.env.*.local
+# SSH keys (safety net - should never be in dotfiles repo anyway)
+id_rsa
+id_dsa
+id_ecdsa
+id_ed25519
+*.pem
+*.key
+# AWS credentials
+.aws/credentials
+# Other credentials
+.netrc
+.gnupg/
+# -----------------------------------------------------------------------------
+# Backups & Temporary Files
+# -----------------------------------------------------------------------------
+*.backup
+*.bak
+*.tmp
+*.temp
+.backup/
+backup/
+*_backup/
+dotfiles_backup/
+# -----------------------------------------------------------------------------
+# Package Managers & Dependencies
+# -----------------------------------------------------------------------------
+node_modules/
+.npm/
+.yarn/
+# -----------------------------------------------------------------------------
+# Cache & Generated Files
+# -----------------------------------------------------------------------------
+.cache/
+*.log
+*.pid
+# Oh My Zsh custom (if you add custom plugins locally)
+# .oh-my-zsh/custom/
+# -----------------------------------------------------------------------------
+# Claude Code - Personal Data & State Files
+# -----------------------------------------------------------------------------
+# Exclude personal data, session history, and cache
+.claude.json
+.claude.json.backup
+.claude/history.jsonl
+.claude/file-history/
+.claude/todos/
+.claude/session-env/
+.claude/shell-snapshots/
+.claude/debug/
+.claude/statsig/
+.claude/.anthropic/
+.claude/settings.local.json
+# Keep these Claude Code files (should be committed):
+# .claude/commands/ - Custom slash commands
+# .claude/settings.json - Project settings
+# .mcp.json - MCP server configuration
+# CLAUDE.md - Project context
+# -----------------------------------------------------------------------------
+# Project Specific
+# -----------------------------------------------------------------------------
+# Test files you might create while testing configs
+test/
+scratch/
+# -----------------------------------------------------------------------------
+# Python
+# -----------------------------------------------------------------------------
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.venv/
+venv/
+.eggs/
+*.egg-info/
+*.egg
+dist/
+build/
+.mypy_cache/
+.pytest_cache/
+.ruff_cache/
+htmlcov/
+.coverage
+.coverage.*
+# graphify
+graphify-out/

graphrag_core-0.2.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/).
+## [0.2.0] - 2026-04-12
+### Added
+- **BB1: Document Ingestion** — PDF, DOCX, Text, Markdown parsers; TokenChunker; IngestionPipeline
+- **BB2: Schema-Guided Extraction** — LLMClient Protocol, AnthropicLLMClient, LLMExtractionEngine with strict schema validation
+- **BB3: Provenance-Native Graph** — InMemoryGraphStore, Neo4jGraphStore with full provenance tracking
+- **BB4: Hybrid Search** — InMemorySearchEngine, Neo4jHybridSearch with Reciprocal Rank Fusion
+- **BB5: Governed Curation** — DeterministicDetectionLayer (duplicates, orphans, schema violations), CurationPipeline
+- **BB6: Entity Registry** — InMemoryEntityRegistry with exact/fuzzy matching (token normalization + SequenceMatcher)
+- **BB7: Tool Library** — ToolLibrary with 4 core tools (get_entity, search_entities, get_audit_trail, get_related)
+- **BB8: Multi-Agent Orchestration** — Agent/Orchestrator/ReportRenderer Protocols, SequentialOrchestrator, AgentContext
+- Cypher injection protection via identifier validation
+- Integration test framework with `--run-integration` flag
+- Optional dependencies: `graphrag-core[neo4j]`, `graphrag-core[anthropic]`, `graphrag-core[all]`
+### Protocols (defined, no default implementation yet)
+- `LLMCurationLayer`, `ApprovalGateway` (BB5 layers 2-3)
+- `ReportRenderer` (BB8)
+- `EmbeddingModel` (cross-cutting)
+## [0.1.0] - 2026-04-10
+### Added
+- Initial commit establishing prior art
+- BB1-BB4 Protocol interfaces (`DocumentParser`, `Chunker`, `ExtractionEngine`, `GraphStore`, `SearchEngine`)
+- Pydantic data models for BB1-BB4
+- Project scaffolding with hatchling build system

graphrag_core-0.2.0/CLAUDE.md ADDED Viewed

@@ -0,0 +1,97 @@
+# graphrag-core
+> Domain-agnostic Graph RAG framework. MIT License. Open Source.
+## What This Is
+Layer 1 of a 3-layer architecture. This repo contains ONLY domain-agnostic platform code.
+Domain-specific logic (construction monitoring, due diligence, compliance) lives in separate repos that import graphrag-core as a dependency.
+## The One Rule That Cannot Be Broken
+**No domain logic in this repo.** If you're importing a construction-specific concept, a customer-specific schema, or any business-domain term — stop and refactor. This code must work equally for construction monitoring, transaction due diligence, forensic investigations, or any other document-heavy knowledge work.
+Test: Could a team building a legal compliance graph use this code without modification? If no → it doesn't belong here.
+## Architecture
+8 building blocks, each with an abstract interface (Protocol) and default implementation:
+| # | Block | Interface | Default Impl |
+|---|---|---|---|
+| 1 | Document Ingestion | `DocumentParser`, `Chunker`, `IngestionPipeline` | PDF/DOCX parsers, semantic chunker |
+| 2 | Entity Extraction | `ExtractionEngine`, `OntologySchema` | LLM-based extraction |
+| 3 | Knowledge Graph | `GraphStore` | `Neo4jGraphStore` |
+| 4 | Hybrid Search | `SearchEngine` | `Neo4jHybridSearch` |
+| 5 | Governed Curation | `DetectionLayer`, `LLMCurationLayer`, `ApprovalGateway` | GDS detection, CLI approval |
+| 6 | Entity Registry | `EntityRegistry` | Neo4j-backed registry |
+| 7 | Core Tool Library | `ToolLibrary`, `Tool` | 8 core tools |
+| 8 | Orchestration | `Orchestrator`, `ReportRenderer` | LangGraph, DOCX renderer |
+## Tech Stack
+- Python 3.12+
+- Pydantic v2 for all data models
+- Neo4j (default graph backend, swappable via GraphStore interface)
+- pytest + pytest-asyncio for tests
+- Type hints everywhere. No exceptions.
+## Code Rules
+- All interfaces are `Protocol` classes in `interfaces.py`
+- All data models are `BaseModel` classes in `models.py`
+- Async by default for all I/O
+- Functions < 30 lines. Extract early.
+- Docstrings: Google style, English only.
+- No hardcoded technology references in interface definitions
+- Default implementations live alongside interfaces but are clearly separated
+## Project Structure
+```
+src/graphrag_core/
+├── interfaces.py       # ALL Protocol definitions
+├── models.py           # ALL Pydantic models
+├── ingestion/          # BB1: Parse, chunk, embed, store
+├── extraction/         # BB2: Schema-guided entity extraction
+├── graph/              # BB3: GraphStore + Neo4j default
+├── search/             # BB4: Hybrid search
+├── curation/           # BB5: 3-layer governance
+├── registry/           # BB6: Known entity dedup
+├── tools/              # BB7: Core tool library (semantic layer)
+├── agents/             # BB8: Orchestration + report rendering
+└── report/             # BB8: Report renderer
+```
+## Extension Pattern
+Domain layers extend graphrag-core by:
+1. Defining an `OntologySchema` (node types, relationships)
+2. Registering domain tools via `ToolLibrary.register()`
+3. Implementing domain-specific `Agent` subclasses
+4. Optionally providing a custom `ReportRenderer`
+```python
+# Example: construction monitoring domain
+from graphrag_core import OntologySchema, ToolLibrary, Agent
+schema = OntologySchema(node_types=[...], relationship_types=[...])
+tool_library.register(my_domain_tool)
+class PerspectiveAgent(Agent):
+    async def execute(self, context): ...
+```
+## Commands
+```bash
+pytest tests/ -x -q                    # tests (fail fast)
+pytest tests/ -x -q --cov             # with coverage
+docker compose up neo4j                # start Neo4j for integration tests
+python -m graphrag_core.graph.schema   # apply schema
+```
+## What Does NOT Belong Here
+- Employer-specific anything (deployment configs, client references, internal tooling)
+- Domain-specific terms (MonitoringTopic, SubjectArea, Perspective, CapturePoint, SollIstAbgleich, InvestorAlert)
+- Hardcoded LLM model names (use config/env vars)
+- Any reference to specific organizations or engagements
+## Release Strategy
+- Semantic versioning (MAJOR.MINOR.PATCH)
+- Public GitHub repo
+- Published to PyPI as `graphrag-core`
+- CHANGELOG.md tracks all changes
+- First public commit establishes prior art before any organizational use

graphrag_core-0.2.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 graphrag-core contributors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

graphrag_core-0.2.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,182 @@
+Metadata-Version: 2.4
+Name: graphrag-core
+Version: 0.2.0
+Summary: Domain-agnostic Graph RAG framework for building governed, auditable Knowledge Graphs
+Project-URL: Homepage, https://github.com/cdel1/graphrag-core
+Project-URL: Repository, https://github.com/cdel1/graphrag-core
+Project-URL: Issues, https://github.com/cdel1/graphrag-core/issues
+Author: Dino Celi
+License-Expression: MIT
+License-File: LICENSE
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Typing :: Typed
+Requires-Python: >=3.12
+Requires-Dist: pydantic>=2.0
+Requires-Dist: pypdf>=4.0
+Requires-Dist: python-docx>=1.0
+Provides-Extra: all
+Requires-Dist: anthropic>=0.40; extra == 'all'
+Requires-Dist: neo4j>=5.0; extra == 'all'
+Provides-Extra: anthropic
+Requires-Dist: anthropic>=0.40; extra == 'anthropic'
+Provides-Extra: neo4j
+Requires-Dist: neo4j>=5.0; extra == 'neo4j'
+Description-Content-Type: text/markdown
+# graphrag-core
+A domain-agnostic framework for building governed, auditable Knowledge Graphs from documents using LLM-powered extraction, provenance-native storage, and multi-agent orchestration.
+## Architecture
+```
+YOUR DOMAIN LAYER (Layer 2)
+  Ontology, domain tools, domain agents, templates
+                    |
+                    | imports
+                    v
+graphrag-core (Layer 1)
+  Ingestion   Extraction   Graph Store   Search
+  Curation    Registry     Tool Library  Orchestration
+```
+## Install
+```bash
+pip install graphrag-core                    # core (in-memory backends)
+pip install graphrag-core[neo4j]             # + Neo4j graph store and search
+pip install graphrag-core[anthropic]         # + Claude LLM client
+pip install graphrag-core[all]               # everything
+```
+## Quick Start
+```python
+import asyncio
+from graphrag_core import (
+    TextParser, TokenChunker, IngestionPipeline,
+    InMemoryGraphStore, InMemorySearchEngine,
+    LLMExtractionEngine, OntologySchema, NodeTypeDefinition,
+    PropertyDefinition, RelationshipTypeDefinition,
+    ToolLibrary, register_core_tools,
+)
+from graphrag_core.models import ChunkConfig, DocumentChunk, GraphNode, ImportRun
+from datetime import datetime
+async def main():
+    # 1. Ingest a document
+    pipeline = IngestionPipeline(parser=TextParser(), chunker=TokenChunker())
+    chunks = await pipeline.ingest(b"Alice works at Acme Corp.", "text/plain")
+    # 2. Define your domain schema
+    schema = OntologySchema(
+        node_types=[
+            NodeTypeDefinition(
+                label="Person",
+                properties=[PropertyDefinition(name="name", type="string", required=True)],
+                required_properties=["name"],
+            ),
+            NodeTypeDefinition(
+                label="Company",
+                properties=[PropertyDefinition(name="name", type="string", required=True)],
+                required_properties=["name"],
+            ),
+        ],
+        relationship_types=[
+            RelationshipTypeDefinition(type="WORKS_AT", source_types=["Person"], target_types=["Company"]),
+        ],
+    )
+    # 3. Extract entities (requires an LLMClient implementation)
+    # engine = LLMExtractionEngine(llm_client=your_client)
+    # result = await engine.extract(chunks, schema, import_run)
+    # 4. Store in graph
+    store = InMemoryGraphStore()
+    await store.merge_node(GraphNode(id="p1", label="Person", properties={"name": "Alice"}), "run-1")
+    await store.merge_node(GraphNode(id="c1", label="Company", properties={"name": "Acme Corp"}), "run-1")
+    # 5. Search
+    search = InMemorySearchEngine(
+        nodes=[await store.get_node("p1"), await store.get_node("c1")],
+    )
+    results = await search.fulltext_search("Acme", top_k=5)
+    print(results)
+    # 6. Wire up tools for agents
+    library = ToolLibrary()
+    register_core_tools(library, store, search)
+    result = await library.execute("get_entity", entity_id="p1")
+    print(result)
+asyncio.run(main())
+```
+## Building Blocks
+| # | Block | Interface | Implementation | Status |
+|---|---|---|---|---|
+| 1 | Document Ingestion | `DocumentParser`, `Chunker` | PDF, DOCX, Text, Markdown parsers; TokenChunker | Done |
+| 2 | Entity Extraction | `ExtractionEngine`, `LLMClient` | LLMExtractionEngine, AnthropicLLMClient | Done |
+| 3 | Knowledge Graph | `GraphStore` | InMemoryGraphStore, Neo4jGraphStore | Done |
+| 4 | Hybrid Search | `SearchEngine` | InMemorySearchEngine, Neo4jHybridSearch (RRF) | Done |
+| 5 | Governed Curation | `DetectionLayer` | DeterministicDetectionLayer, CurationPipeline | Done (detection layer) |
+| 6 | Entity Registry | `EntityRegistry` | InMemoryEntityRegistry (fuzzy matching) | Done |
+| 7 | Tool Library | `ToolLibrary` | 4 core tools (get_entity, search, audit_trail, related) | Done |
+| 8 | Orchestration | `Agent`, `Orchestrator` | SequentialOrchestrator, AgentContext | Done |
+Protocols marked with `(Protocol only)` have no default implementation yet:
+- `LLMCurationLayer`, `ApprovalGateway` (BB5 layers 2-3)
+- `ReportRenderer` (BB8)
+- `EmbeddingModel` (cross-cutting)
+## Extension Pattern
+```python
+from graphrag_core import OntologySchema, ToolLibrary, Tool
+# 1. Define your domain ontology
+schema = OntologySchema(node_types=[...], relationship_types=[...])
+# 2. Register domain-specific tools
+library = ToolLibrary()
+library.register(Tool(name="my_tool", description="...", parameters={}, handler=my_handler))
+# 3. Implement domain agents
+class MyAgent:
+    name = "analyst"
+    async def execute(self, context):
+        result = await context.tool_library.execute("my_tool")
+        context.workflow_state["analysis"] = result.data
+        return AgentResult(agent_name=self.name, success=True)
+```
+## Development
+```bash
+# Clone and install
+git clone https://github.com/cdel1/graphrag-core.git
+cd graphrag-core
+uv sync --all-extras
+# Run unit tests
+uv run pytest tests/ -x -q
+# Run integration tests (requires Neo4j)
+docker run -d --name neo4j-test -p 7474:7474 -p 7687:7687 \
+  -e NEO4J_AUTH=neo4j/development neo4j:5-community
+uv run pytest tests/ -x --run-integration
+# Build
+uv build
+```
+## License
+MIT