pg-raggraph 0.3.0a2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pg_raggraph-0.3.0a2/.gitignore +31 -0
- pg_raggraph-0.3.0a2/CHANGELOG.md +210 -0
- pg_raggraph-0.3.0a2/CODE_OF_CONDUCT.md +33 -0
- pg_raggraph-0.3.0a2/CONTRIBUTING.md +68 -0
- pg_raggraph-0.3.0a2/LICENSE +21 -0
- pg_raggraph-0.3.0a2/PKG-INFO +390 -0
- pg_raggraph-0.3.0a2/README.md +324 -0
- pg_raggraph-0.3.0a2/SECURITY.md +46 -0
- pg_raggraph-0.3.0a2/benchmarks/age-bakeoff/docker/age/initdb/README.md +9 -0
- pg_raggraph-0.3.0a2/benchmarks/age-bakeoff/pyproject.toml +47 -0
- pg_raggraph-0.3.0a2/benchmarks/medical-hrt/README.md +62 -0
- pg_raggraph-0.3.0a2/benchmarks/musique/README.md +59 -0
- pg_raggraph-0.3.0a2/benchmarks/python-versioned-docs/README.md +34 -0
- pg_raggraph-0.3.0a2/docs/README.md +80 -0
- pg_raggraph-0.3.0a2/docs/archive/README.md +32 -0
- pg_raggraph-0.3.0a2/pyproject.toml +119 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/__init__.py +1432 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/answer.py +140 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/chunking.py +494 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/cli.py +496 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/config.py +237 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/db.py +346 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/embedding.py +123 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/evolution.py +256 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/extraction.py +328 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/mcp_server.py +204 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/models.py +221 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/reranker.py +117 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/resolution.py +83 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/retrieval.py +691 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/server.py +449 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/sql/__init__.py +0 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/sql/migrations/001_embedded_content.sql +33 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/sql/migrations/002_evolution_tracking.sql +99 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/sql/migrations/README.md +10 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/sql/migrations/__init__.py +0 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/sql/schema.sql +218 -0
- pg_raggraph-0.3.0a2/src/pg_raggraph/static/index.html +170 -0
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
__pycache__/
|
|
2
|
+
*.py[cod]
|
|
3
|
+
*$py.class
|
|
4
|
+
*.egg-info/
|
|
5
|
+
dist/
|
|
6
|
+
build/
|
|
7
|
+
.eggs/
|
|
8
|
+
*.egg
|
|
9
|
+
.venv/
|
|
10
|
+
venv/
|
|
11
|
+
.env
|
|
12
|
+
.pytest_cache/
|
|
13
|
+
.ruff_cache/
|
|
14
|
+
.coverage
|
|
15
|
+
htmlcov/
|
|
16
|
+
skill-output/
|
|
17
|
+
.autonomy/
|
|
18
|
+
.claude/
|
|
19
|
+
.worktrees/
|
|
20
|
+
*.db
|
|
21
|
+
|
|
22
|
+
# Downloaded benchmark corpora — too large for git, regenerated by scripts
|
|
23
|
+
benchmarks/hotpotqa/
|
|
24
|
+
benchmarks/postgres-docs/
|
|
25
|
+
benchmarks/scotus/
|
|
26
|
+
benchmarks/kg-rag-eval/
|
|
27
|
+
benchmarks/*.json
|
|
28
|
+
benchmarks/*.log
|
|
29
|
+
benchmarks/age-bakeoff/corpora/msgraph-work/
|
|
30
|
+
benchmarks/python-versioned-docs/_tmp/
|
|
31
|
+
benchmarks/medical-hrt/_tmp/
|
|
@@ -0,0 +1,210 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
## 0.3.0a2 — 2026-05-02 (pre-public-push hardening)
|
|
4
|
+
|
|
5
|
+
Second prod-ready audit pass on top of `0.3.0a1`, ahead of the public-repo
|
|
6
|
+
flip + first real PyPI release. Five PRs closed (PR-301..PR-305) plus a
|
|
7
|
+
fix for an evolution-tracking bug surfaced during test hardening.
|
|
8
|
+
|
|
9
|
+
### Security
|
|
10
|
+
|
|
11
|
+
- **PR-301 — Bearer auth uses constant-time compare.** `pgrg serve`'s
|
|
12
|
+
optional `PGRG_SERVER_API_KEY` Bearer middleware now uses
|
|
13
|
+
`secrets.compare_digest` instead of `!=`. The previous comparison
|
|
14
|
+
short-circuited on the first differing byte and leaked the key
|
|
15
|
+
length + prefix via response timing — bypassable by a network
|
|
16
|
+
attacker over thousands of probes. Added regression-lock test that
|
|
17
|
+
spies on `secrets.compare_digest` and asserts the auth path actually
|
|
18
|
+
invokes it (catches a future revert to `==` directly).
|
|
19
|
+
- **PR-303 — Defense-in-depth security headers.** New middleware
|
|
20
|
+
attaches `Content-Security-Policy`, `X-Content-Type-Options: nosniff`,
|
|
21
|
+
`X-Frame-Options: DENY`, and `Referrer-Policy: no-referrer` to every
|
|
22
|
+
response — including the 401/403 short-circuits from the auth
|
|
23
|
+
middleware. CSP allows `https://unpkg.com` for the bundled
|
|
24
|
+
`vis-network` UI; tighten to `'self'` once the JS is bundled locally.
|
|
25
|
+
- **PR-304 — MCP `pgrg_ingest` enforces the file-extension allowlist.**
|
|
26
|
+
Hoisted the canonical extension list to `pg_raggraph.INGEST_ALLOWED_EXTS`
|
|
27
|
+
so the FastAPI `/ingest` endpoint, the MCP `pgrg_ingest` tool, and
|
|
28
|
+
the library's directory walker all share one source. An LLM agent
|
|
29
|
+
that asks the MCP server to ingest a `.exe`, `.so`, `.tar`, etc.
|
|
30
|
+
now gets a structured `{"error": "unsupported_extension", ...}`
|
|
31
|
+
response — no garbage entities polluting the knowledge graph.
|
|
32
|
+
|
|
33
|
+
### Packaging / DX
|
|
34
|
+
|
|
35
|
+
- **PR-302 — `[project.urls]` table on PyPI.** `pyproject.toml` now
|
|
36
|
+
declares Homepage, Repository, Issues, Changelog, and Documentation
|
|
37
|
+
URLs. The PyPI project page sidebar surfaces these — without them
|
|
38
|
+
the listing was barebones, a permanent first-impression tax.
|
|
39
|
+
- **PR-305 — Reranker actionable ImportError.** When fastembed's
|
|
40
|
+
cross-encoder submodule isn't available, `FastEmbedReranker._load()`
|
|
41
|
+
now raises `ImportError` with a `pip install --upgrade 'fastembed>=0.4'`
|
|
42
|
+
hint instead of letting a bare `ModuleNotFoundError` propagate. Matches
|
|
43
|
+
the pattern used for the chunkshop integration in `chunking.py`.
|
|
44
|
+
|
|
45
|
+
### Bug fixes
|
|
46
|
+
|
|
47
|
+
- **datetime metadata no longer crashes ingest.** `rag.ingest(metadata={...})`
|
|
48
|
+
with `effective_from` / `effective_to` `datetime` values previously
|
|
49
|
+
failed with `TypeError: Object of type datetime is not JSON serializable`
|
|
50
|
+
in the `documents.metadata` JSONB path (and similarly for chunk
|
|
51
|
+
metadata in the chunkshop pre-chunked path). Added a `_json_default`
|
|
52
|
+
helper that serializes datetimes as ISO 8601 strings — queryable from
|
|
53
|
+
JSONB via `metadata->>'effective_from'`. Fixed 5 evolution-tier1
|
|
54
|
+
integration tests that were silently failing on `main` since the
|
|
55
|
+
`22d83f7` documents-metadata-persistence change.
|
|
56
|
+
|
|
57
|
+
### Tests
|
|
58
|
+
|
|
59
|
+
- 13 new tests in `tests/integration/test_error_paths.py` covering
|
|
60
|
+
PR-301..PR-304 (Bearer auth contract, security-header presence on
|
|
61
|
+
success + auth-failure paths, MCP `pgrg_ingest` extension rejection
|
|
62
|
+
and partial-state guard, public `INGEST_ALLOWED_EXTS` import).
|
|
63
|
+
- New `tests/unit/test_reranker.py` (2 tests) covering PR-305.
|
|
64
|
+
- All 204 tests pass on the full suite.
|
|
65
|
+
|
|
66
|
+
### CI / hygiene
|
|
67
|
+
|
|
68
|
+
- Cleared the lint+format failures that had been silently red on `main`
|
|
69
|
+
for several pushes. `ruff check .` and `ruff format --check .` are
|
|
70
|
+
now both green. Two new excludes added: `benchmarks/sales-crm-demo/`
|
|
71
|
+
(cookbook demo with its own SQL-heavy conventions, matches the prior
|
|
72
|
+
`benchmarks/age-bakeoff/` precedent) and `docs/cookbook/samples/*.py`
|
|
73
|
+
(documentation/demo scripts). One auto-fixed import sort in
|
|
74
|
+
`chunking.py`; one trimmed docstring example in `__init__.py`.
|
|
75
|
+
|
|
76
|
+
## 0.3.0a1 — 2026-04-28 (post-audit hardening)
|
|
77
|
+
|
|
78
|
+
First public PyPI release. Polish + hardening pass on top of `0.3.0a0`. No public-API changes; all 23 production-readiness items from the prod-ready audit closed (22 fixed, 1 false positive). Library + server ready for external use.
|
|
79
|
+
|
|
80
|
+
### Real-world Tier 1 benchmarks
|
|
81
|
+
|
|
82
|
+
- **`benchmarks/python-versioned-docs/`** — 12 Python docs (3.10/3.11/3.12), 1364 chunks, 15 hand-written gold questions. **13/13 perfect `version_filter` purity** (100%). Closes the "Tier 1 only synthetic-fixture-tested" gap.
|
|
83
|
+
- **`benchmarks/medical-hrt/`** — 48 PubMed HRT/CV abstracts, 7 epistemically-retracted (modeling WHI 2002 supersession of the prior consensus), 15 gold questions. **5/5 retraction_aware + 5/5 time_travel = 15/15 perfect.** First real-world demonstration of `retracted_behavior="hide"` + `as_of`.
|
|
84
|
+
- 3-part dev-rel blog series in [`docs/blog/`](docs/blog/) walking through both paths from a fresh `git clone`.
|
|
85
|
+
- [`docs/USE-CASES.md`](docs/USE-CASES.md) — decision matrix for classic GraphRAG vs evolving knowledge.
|
|
86
|
+
|
|
87
|
+
### Server hardening (PR-103, PR-104, PR-205, PR-208)
|
|
88
|
+
|
|
89
|
+
- **`/graph` pagination** — default `LIMIT 500`, `?limit=N` (max 5000), `?limit=all` for tiny corpora. No more browser OOM on real-corpus visualization.
|
|
90
|
+
- **`/ingest` hardening** — `PGRG_SERVER_MAX_UPLOAD_MB` cap (default 100 MB → 413), extension allowlist (→ 415), filename sanitization (path-traversal-safe, → 400 on empty), temp-file cleanup wrapped in `try/finally` so leaks are impossible.
|
|
91
|
+
- **Optional Bearer auth** — `PGRG_SERVER_API_KEY` env enables auth middleware. Server logs a startup WARN when the env is unset so the unauthenticated state is loud, not silent. `/health` and `/ready` always probe-friendly.
|
|
92
|
+
- **Origin allowlist** — `PGRG_SERVER_ALLOWED_ORIGINS` (comma-separated). When unset, only loopback Origins accepted on POST/PUT/DELETE/PATCH; non-browser clients (curl, requests) without Origin headers still work.
|
|
93
|
+
- **`/ready` endpoint** — distinct from `/health`. Verifies DB connectivity AND `pgrg_meta.schema_version >= SCHEMA_VERSION`. Returns 503 with a structured payload on `db_unreachable` / `schema_pending_migration`.
|
|
94
|
+
- **`/query` default mode** — was `hybrid` (the slowest mode); now `smart` (matches `/ask`).
|
|
95
|
+
|
|
96
|
+
### Library hardening (PR-203, PR-206, PR-209, PR-210, PR-211, PR-215, PR-216)
|
|
97
|
+
|
|
98
|
+
- `pg_raggraph.__version__` now resolved via `importlib.metadata.version("pg-raggraph")` so it always matches the installed metadata (no more "0.3.0" string drift from `0.3.0a0` in pyproject).
|
|
99
|
+
- `PGRGConfig` refuses to start with the default Postgres credentials when `PGRG_ENV=production` (raises `RuntimeError`); logs a one-time WARN otherwise.
|
|
100
|
+
- `tune_scoring_weights()` gains a `max_grid_size` parameter (default 50) — refuses to run grids exceeding the cap before any LLM call. Cost-safety guard.
|
|
101
|
+
- `rag.request_shutdown()` for graceful drain of long-running ingest. SIGTERM/SIGINT handlers can wire it; in-flight per-doc transactions finish, queued files become no-ops counted as `skipped`. Re-running `ingest()` resumes via content-hash dedup.
|
|
102
|
+
- `PGRG_LOG_FORMAT=json` — stdlib-only structured logging on stderr (no extra dep). Activates only when no handlers are pre-attached to the `pg_raggraph` logger.
|
|
103
|
+
- `os.nice()` no longer mutates process priority on `PGRGConfig` import. New `apply_nice_level()` method called from `ingest()` where CPU-yield was actually wanted.
|
|
104
|
+
- `ingest_profile` and `extraction_prompt` typed as `Literal[...]` — typos via env now raise `ValidationError` at init instead of silently defaulting.
|
|
105
|
+
|
|
106
|
+
### Renamed: `skimr_spacy` → `lede_spacy`
|
|
107
|
+
|
|
108
|
+
The Tier-2 fact-extractor enum value was renamed to match the package's PyPI name (shipped as `lede` + `lede-spacy` 2026-04-28). Active surfaces updated: `PGRGConfig.fact_extractor` Literal, schema comment, user-guide, cookbook. Released migration `002_evolution_tracking.sql` and dated audit-trail specs under `docs/superpowers/` left untouched per project policy.
|
|
109
|
+
|
|
110
|
+
### CI
|
|
111
|
+
|
|
112
|
+
- **CI fixed.** `.github/workflows/test.yml` was running `pytest tests/integration/ tests/test_e2e.py -v`, but `tests/test_e2e.py` was moved to `tests/integration/test_e2e.py` in the alpha merge — pytest exited 5 (no tests at the explicit path) on every push since 2026-04-27. Removed the stale path.
|
|
113
|
+
- `benchmarks/age-bakeoff` excluded from root ruff config — separate sub-project with its own `pyproject.toml` and lint posture.
|
|
114
|
+
- All 195 tests passing across `tests/unit/` and `tests/integration/` on Python 3.12 and 3.13.
|
|
115
|
+
|
|
116
|
+
### Tests
|
|
117
|
+
|
|
118
|
+
- New `tests/integration/test_error_paths.py` (15 tests) — asserts specific exception types or behaviors on bad DSN, naive `as_of`, oversize `/ingest`, path-traversal filenames, `tune_scoring_weights` cost guard, namespace allowlist, etc.
|
|
119
|
+
- Latency-test thresholds widened from `< 200 ms` to `< 1500 ms` — these were flaking under cold-start CI / contended dev machines despite no real perf regression. Tight perf gating belongs in the dedicated bake-off harness, not in user-journey tests.
|
|
120
|
+
- `test_07_bus_factor` xfail removed — replaced the empirically-flaky directional assertion (`hybrid_score >= naive_score`) with a property-style check (both modes return ≥ 1 expected keyword).
|
|
121
|
+
|
|
122
|
+
### Docs
|
|
123
|
+
|
|
124
|
+
- README rewritten with layered structure (what / why / how → weeds).
|
|
125
|
+
- New [`docs/EVOLUTION-API-QUICKREF.md`](docs/EVOLUTION-API-QUICKREF.md) — common assumptions vs reality for the Tier 1 API (which kwargs are per-query vs config-only, how to read evolution columns, `as_of` × `retracted_at` semantics).
|
|
126
|
+
- `docs/user-guide.md` gains "Schema migrations", "Concurrency / sizing", "Logging", and "Graceful shutdown" subsections.
|
|
127
|
+
- README quickstart switched from `pip install pg-raggraph` (not yet on PyPI) to a clone-based install that actually works.
|
|
128
|
+
- `pgrg serve` now carries an explicit "deploy behind auth, do not expose publicly" banner in README + user-guide.
|
|
129
|
+
|
|
130
|
+
### Dependency / supply-chain
|
|
131
|
+
|
|
132
|
+
- `pip-audit --skip-editable` is clean: zero CVEs in any direct or transitive dependency.
|
|
133
|
+
|
|
134
|
+
## 0.3.0-alpha — 2026-04-25
|
|
135
|
+
|
|
136
|
+
### Added
|
|
137
|
+
|
|
138
|
+
- **Evolving-knowledge RAG, Tier 1 (Structural).** Opt-in evolution tracking
|
|
139
|
+
that respects document effective-dates, retractions, and supersession at
|
|
140
|
+
the document level. Opt in via `PGRGConfig(evolution_tier="structural")`
|
|
141
|
+
or env `PGRG_EVOLUTION_TIER=structural`.
|
|
142
|
+
- `rag.ingest(metadata={...})` now accepts `effective_from`, `effective_to`,
|
|
143
|
+
`retracted`, `retracted_at`, `retraction_reason`, `version_label`,
|
|
144
|
+
`supersedes_document_id`. Per-ingest scope (applies to every file in the
|
|
145
|
+
call).
|
|
146
|
+
- `rag.query()` new kwargs: `as_of=datetime(...)` time-travel filter,
|
|
147
|
+
`version_filter="..."` version restriction, `evolution_aware=False`
|
|
148
|
+
per-call override to force classic retrieval.
|
|
149
|
+
- `rag.tune_scoring_weights(namespace, gold, grid, ...)` grid-search
|
|
150
|
+
utility for per-corpus weight tuning. Writes the best cell back to
|
|
151
|
+
`rag.config`.
|
|
152
|
+
- Schema: three new tables (`facts`, `fact_edges`, `document_versions`)
|
|
153
|
+
and four new columns on `documents` via migration
|
|
154
|
+
`002_evolution_tracking.sql`. All additive; fact-level tables stay empty
|
|
155
|
+
at Tier 1.
|
|
156
|
+
- Behavior modes: `retracted_behavior` ∈ {hide, flag, surface_both};
|
|
157
|
+
`supersession_behavior` ∈ {hide, prefer_new, surface_both}.
|
|
158
|
+
|
|
159
|
+
### Changed
|
|
160
|
+
|
|
161
|
+
- `PGRGConfig` gains 15+ fields for evolution tracking. Defaults leave
|
|
162
|
+
Tier 0 behavior unchanged.
|
|
163
|
+
- Retrieval SQL templates (`naive`, `local`, `global`) are now built
|
|
164
|
+
per-query from the config rather than stored as string constants. When
|
|
165
|
+
`evolution_tier="off"`, the generated SQL is semantically identical to
|
|
166
|
+
the prior version.
|
|
167
|
+
|
|
168
|
+
### Deferred to future tiers
|
|
169
|
+
|
|
170
|
+
- Fact-level extraction (Tier 2).
|
|
171
|
+
- LLM-inferred fact edges and contradiction detection (Tier 3).
|
|
172
|
+
- Async slow-path fact-edge inference (Tier 3).
|
|
173
|
+
|
|
174
|
+
See `docs/cookbook/evolution-tracking.md` for the quickstart.
|
|
175
|
+
|
|
176
|
+
## 2026-04-20 — `chunk_strategy="hierarchy"` opt-in chunker
|
|
177
|
+
|
|
178
|
+
### Added
|
|
179
|
+
|
|
180
|
+
- **`chunk_strategy="hierarchy"`** — heading-prefixed chunker ported from the AGE bake-off (`benchmarks/age-bakeoff/src/age_bakeoff/chunker.py:_split_hierarchy`). Each section body is prefixed with its markdown heading so pgvector embeds `heading+body` as one unit. When a document has no headings, the body is prefixed with a derived title (first H1, else source filename). No token-budget split — sections over `chunk_max_tokens` are passed through unchanged and get truncated at embed time, mirroring the benchmarked behavior byte-for-byte.
|
|
181
|
+
- **When to use it:** corpora with concrete, per-doc disambiguating titles — SCOTUS-style case names ("Miranda v. Arizona"), article titles, product names. On the SCOTUS corpus this cleared DC-003 by 2.5× across all six retrieval modes (`benchmarks/age-bakeoff/results/REPORT-VERDICT.md` §6).
|
|
182
|
+
- **When NOT to use it:** corpora with format-string titles that repeat across docs — meeting updates ("Weekly sync: …"), ticket prefixes, templated status reports. The acme replication on that shape regressed −1 to −2 questions per retrieval mode and tripled hallucinations (`benchmarks/age-bakeoff/results/ACME-HIER-REPLICATION.md`).
|
|
183
|
+
- Default `chunk_strategy` remains `"auto"`. This is an opt-in config, not a behavior change for existing users.
|
|
184
|
+
|
|
185
|
+
## 2026-04-17 — AGE Bake-Off Benchmark (v0.3.1)
|
|
186
|
+
|
|
187
|
+
### Added
|
|
188
|
+
|
|
189
|
+
- **AGE vs pg-raggraph bake-off benchmark** (`benchmarks/age-bakeoff/`) — reproducible head-to-head comparison on two corpora (Acme Labs + SCOTUS) measuring retrieval latency, answer quality (LLM judge), and fact recall across 60 gold-labeled questions.
|
|
190
|
+
- **Benchmark results:**
|
|
191
|
+
- pg-raggraph retrieval is **1.4x faster on Acme** (33ms vs 47ms p50) and **47x faster on SCOTUS** (60ms vs 2,863ms p50)
|
|
192
|
+
- Answer quality roughly comparable; AGE slightly better on Acme (zero hallucinations vs 3), tied on SCOTUS
|
|
193
|
+
- Full pipeline: shared chunker, engine adapters, runner, fact-recall scorer, LLM judge (gpt-4.1-mini), deterministic report generator
|
|
194
|
+
- **`docs/why-not-apache-age.md`** — user-facing guide distilled from the research doc, now updated with measured bake-off numbers replacing cited third-party benchmarks
|
|
195
|
+
- **70 passing tests** for the benchmark suite (all mocked, no external API calls in test suite)
|
|
196
|
+
- **Docker stack** with both engines side-by-side (pgvector/pg16 on 5434, AGE+pgvector on 5435)
|
|
197
|
+
- **CLI** (`age-bakeoff ingest|run|judge|report`) for one-command reproduction
|
|
198
|
+
|
|
199
|
+
### Fixed
|
|
200
|
+
|
|
201
|
+
- Entity INSERT uses `ON CONFLICT` for corpora with duplicate entity names (SCOTUS has duplicate case names)
|
|
202
|
+
- `relationship_chunks` linking scoped to relevant chunks only (was O(R*C) = 3.5M INSERTs for SCOTUS; now O(R*matches))
|
|
203
|
+
- Chunker `_split_plain` hard-splits oversized paragraphs (was silently emitting chunks > MAX_CHARS)
|
|
204
|
+
- `BakeoffConfig` strips/rejects whitespace-only `OPENAI_API_KEY`
|
|
205
|
+
|
|
206
|
+
### Infrastructure
|
|
207
|
+
|
|
208
|
+
- Postgres REL_16_5 executor+planner slice cloned via sparse-checkout (116 .c files) for the code corpus (pg-src questions written, extraction pipeline ready, run deferred to next session)
|
|
209
|
+
- Acme seed data: 42 entities, 103 relationships, 160 documents mirrored from graphrag-demo
|
|
210
|
+
- SCOTUS seed data: 416 entities, 4,397 relationships, 772 documents mirrored from graphrag-demo
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
# Code of Conduct
|
|
2
|
+
|
|
3
|
+
This project adopts the **Contributor Covenant v2.1** as its code of conduct. The canonical text is maintained at:
|
|
4
|
+
|
|
5
|
+
- <https://www.contributor-covenant.org/version/2/1/code_of_conduct/>
|
|
6
|
+
|
|
7
|
+
All contributors, maintainers, and participants in project spaces (GitHub issues, pull requests, discussions, any associated chat channels) are expected to follow it.
|
|
8
|
+
|
|
9
|
+
## Scope
|
|
10
|
+
|
|
11
|
+
This code applies within all project spaces and also applies when an individual is officially representing the project in public spaces.
|
|
12
|
+
|
|
13
|
+
## Reporting
|
|
14
|
+
|
|
15
|
+
Instances of behavior that violates the Contributor Covenant may be reported to the project maintainer at **matt@theyonk.com**. All reports will be reviewed and investigated promptly and fairly. Reporter identity is kept confidential.
|
|
16
|
+
|
|
17
|
+
## Enforcement
|
|
18
|
+
|
|
19
|
+
The maintainer is responsible for clarifying and enforcing the standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior they deem inappropriate, threatening, offensive, or harmful.
|
|
20
|
+
|
|
21
|
+
Enforcement may include:
|
|
22
|
+
|
|
23
|
+
1. A private warning with an explanation of what needs to change and why.
|
|
24
|
+
2. A public warning on the relevant issue or PR.
|
|
25
|
+
3. A temporary or permanent ban from project spaces.
|
|
26
|
+
|
|
27
|
+
The severity of the response is guided by the Contributor Covenant's enforcement guidelines at:
|
|
28
|
+
|
|
29
|
+
- <https://www.contributor-covenant.org/version/2/1/code_of_conduct/#enforcement-guidelines>
|
|
30
|
+
|
|
31
|
+
## Attribution
|
|
32
|
+
|
|
33
|
+
This Code of Conduct is adopted from the [Contributor Covenant](https://www.contributor-covenant.org/), version 2.1, available at the URL above. The Contributor Covenant is licensed under CC BY 4.0.
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
# Contributing to pg-raggraph
|
|
2
|
+
|
|
3
|
+
Thanks for considering a contribution. This is a small, focused library and we want to keep it that way — clear code, honest benchmarks, and a low barrier to reading the whole thing in a sitting.
|
|
4
|
+
|
|
5
|
+
## What kinds of changes we welcome
|
|
6
|
+
|
|
7
|
+
- **Bug reports and fixes.** File an issue first for anything non-obvious so we can agree on the shape of the fix.
|
|
8
|
+
- **New chunkers, embedders, or retrieval modes** — ship them behind a config flag, with a test, and a benchmark-table entry showing the tradeoff.
|
|
9
|
+
- **Documentation fixes and clarifications.** Always welcome.
|
|
10
|
+
- **Benchmark extensions.** New corpora, new question sets, new metrics. Evidence is the point.
|
|
11
|
+
|
|
12
|
+
## What to check before opening a PR
|
|
13
|
+
|
|
14
|
+
- Tests pass: `uv run pytest`
|
|
15
|
+
- Lint clean: `uv run ruff check . && uv run ruff format --check .`
|
|
16
|
+
- New behavior has a test. New config knobs are documented in `docs/user-guide.md` and the `README.md` config table.
|
|
17
|
+
- Benchmark numbers in the PR description cite a raw result file, not a summary paragraph.
|
|
18
|
+
|
|
19
|
+
## Local development setup
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
# 1. Clone and enter the repo
|
|
23
|
+
git clone https://github.com/<you>/pg-raggraph.git
|
|
24
|
+
cd pg-raggraph
|
|
25
|
+
|
|
26
|
+
# 2. Install dependencies (Python 3.12+ required)
|
|
27
|
+
uv sync --all-extras
|
|
28
|
+
|
|
29
|
+
# 3. Start PostgreSQL with pgvector + pg_trgm
|
|
30
|
+
docker compose up -d
|
|
31
|
+
|
|
32
|
+
# 4. Copy env examples and fill in keys
|
|
33
|
+
cp .env.example .env
|
|
34
|
+
cp benchmarks/age-bakeoff/.env.example benchmarks/age-bakeoff/.env
|
|
35
|
+
# edit both files — see README for the full config table
|
|
36
|
+
|
|
37
|
+
# 5. Run the test suite
|
|
38
|
+
uv run pytest # all tests (needs DB up)
|
|
39
|
+
uv run pytest tests/unit/ # just unit (no DB needed)
|
|
40
|
+
uv run pytest tests/integration/ # integration (needs DB up)
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Code style
|
|
44
|
+
|
|
45
|
+
- **Python 3.12+, async-first.** All database operations use `asyncpg` / `psycopg` async.
|
|
46
|
+
- **Ruff for lint + format.** We match `pyproject.toml` settings; don't reformat with a different tool.
|
|
47
|
+
- **Small, focused PRs.** One logical change per PR. Bug fix ≠ refactor ≠ feature; split them.
|
|
48
|
+
- **Comments are rare.** Name things well enough that most comments are unnecessary. When a comment is warranted, explain *why*, not *what*.
|
|
49
|
+
- **Tests tell the story.** If the behavior isn't obvious from a test, the test is unclear.
|
|
50
|
+
|
|
51
|
+
## Commit and PR expectations
|
|
52
|
+
|
|
53
|
+
- Commit messages follow the existing repo style — see `git log --oneline -20` for examples. A one-line summary with a short imperative verb (`feat:`, `fix:`, `docs:`, `test:`) followed by a body explaining *why* the change matters.
|
|
54
|
+
- PR titles should finish the sentence "This PR ...". Bodies should explain the user-facing effect and link any relevant issue.
|
|
55
|
+
- PRs that change benchmark or library defaults must include before/after numbers from a real run.
|
|
56
|
+
- Co-authored-by lines are welcome.
|
|
57
|
+
|
|
58
|
+
## When in doubt
|
|
59
|
+
|
|
60
|
+
Open a draft PR early or start a discussion issue. Aligning on approach before you've written a lot of code saves everyone time.
|
|
61
|
+
|
|
62
|
+
## Code of conduct
|
|
63
|
+
|
|
64
|
+
This project follows the [Contributor Covenant](CODE_OF_CONDUCT.md). Please read it before opening an issue or PR.
|
|
65
|
+
|
|
66
|
+
## Security
|
|
67
|
+
|
|
68
|
+
See [SECURITY.md](SECURITY.md) for how to report vulnerabilities.
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 yonk-tools
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|