raglineage 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (60) hide show
  1. raglineage-0.1.1/CONTRIBUTING.md +57 -0
  2. raglineage-0.1.1/LICENSE +17 -0
  3. raglineage-0.1.1/MANIFEST.in +7 -0
  4. raglineage-0.1.1/PKG-INFO +315 -0
  5. raglineage-0.1.1/README.md +275 -0
  6. raglineage-0.1.1/examples/data/products.csv +6 -0
  7. raglineage-0.1.1/examples/data/sample.txt +15 -0
  8. raglineage-0.1.1/pyproject.toml +84 -0
  9. raglineage-0.1.1/raglineage/__init__.py +6 -0
  10. raglineage-0.1.1/raglineage/api.py +406 -0
  11. raglineage-0.1.1/raglineage/audit/__init__.py +15 -0
  12. raglineage-0.1.1/raglineage/audit/auditor.py +46 -0
  13. raglineage-0.1.1/raglineage/audit/checks.py +98 -0
  14. raglineage-0.1.1/raglineage/cli/__init__.py +1 -0
  15. raglineage-0.1.1/raglineage/cli/main.py +137 -0
  16. raglineage-0.1.1/raglineage/config.py +22 -0
  17. raglineage-0.1.1/raglineage/embedding/__init__.py +13 -0
  18. raglineage-0.1.1/raglineage/embedding/base.py +41 -0
  19. raglineage-0.1.1/raglineage/embedding/local.py +45 -0
  20. raglineage-0.1.1/raglineage/embedding/openai.py +54 -0
  21. raglineage-0.1.1/raglineage/ingest/__init__.py +8 -0
  22. raglineage-0.1.1/raglineage/ingest/auto.py +49 -0
  23. raglineage-0.1.1/raglineage/ingest/base.py +37 -0
  24. raglineage-0.1.1/raglineage/ingest/files.py +60 -0
  25. raglineage-0.1.1/raglineage/ingest/tabular.py +103 -0
  26. raglineage-0.1.1/raglineage/lineage/__init__.py +7 -0
  27. raglineage-0.1.1/raglineage/lineage/diff.py +99 -0
  28. raglineage-0.1.1/raglineage/lineage/graph.py +167 -0
  29. raglineage-0.1.1/raglineage/lineage/versioning.py +159 -0
  30. raglineage-0.1.1/raglineage/retrieval/__init__.py +6 -0
  31. raglineage-0.1.1/raglineage/retrieval/filters.py +74 -0
  32. raglineage-0.1.1/raglineage/retrieval/retriever.py +88 -0
  33. raglineage-0.1.1/raglineage/schemas/__init__.py +16 -0
  34. raglineage-0.1.1/raglineage/schemas/audit.py +48 -0
  35. raglineage-0.1.1/raglineage/schemas/dataset.py +67 -0
  36. raglineage-0.1.1/raglineage/schemas/lineage_node.py +86 -0
  37. raglineage-0.1.1/raglineage/store/__init__.py +7 -0
  38. raglineage-0.1.1/raglineage/store/base.py +59 -0
  39. raglineage-0.1.1/raglineage/store/faiss_store.py +117 -0
  40. raglineage-0.1.1/raglineage/store/mapping.py +76 -0
  41. raglineage-0.1.1/raglineage/transform/__init__.py +15 -0
  42. raglineage-0.1.1/raglineage/transform/base.py +29 -0
  43. raglineage-0.1.1/raglineage/transform/chunkers.py +173 -0
  44. raglineage-0.1.1/raglineage/transform/dedupe.py +45 -0
  45. raglineage-0.1.1/raglineage/transform/normalize.py +57 -0
  46. raglineage-0.1.1/raglineage/utils/__init__.py +14 -0
  47. raglineage-0.1.1/raglineage/utils/hashing.py +36 -0
  48. raglineage-0.1.1/raglineage/utils/io.py +57 -0
  49. raglineage-0.1.1/raglineage/utils/logging.py +28 -0
  50. raglineage-0.1.1/raglineage.egg-info/PKG-INFO +315 -0
  51. raglineage-0.1.1/raglineage.egg-info/SOURCES.txt +58 -0
  52. raglineage-0.1.1/raglineage.egg-info/dependency_links.txt +1 -0
  53. raglineage-0.1.1/raglineage.egg-info/entry_points.txt +2 -0
  54. raglineage-0.1.1/raglineage.egg-info/requires.txt +16 -0
  55. raglineage-0.1.1/raglineage.egg-info/top_level.txt +1 -0
  56. raglineage-0.1.1/setup.cfg +4 -0
  57. raglineage-0.1.1/tests/test_end_to_end_small_dataset.py +43 -0
  58. raglineage-0.1.1/tests/test_graph_diff.py +87 -0
  59. raglineage-0.1.1/tests/test_incremental_update.py +34 -0
  60. raglineage-0.1.1/tests/test_lineage_node.py +43 -0
@@ -0,0 +1,57 @@
1
+ # Contributing to raglineage
2
+
3
+ Thank you for your interest in contributing to raglineage! This document provides guidelines and instructions for contributing.
4
+
5
+ ## Development Setup
6
+
7
+ 1. Fork the repository
8
+ 2. Clone your fork:
9
+ ```bash
10
+ git clone https://github.com/YOUR_USERNAME/raglineage.git
11
+ cd raglineage
12
+ ```
13
+ 3. Install in development mode:
14
+ ```bash
15
+ pip install -e ".[dev]"
16
+ ```
17
+ 4. Run tests to ensure everything works:
18
+ ```bash
19
+ pytest
20
+ ```
21
+
22
+ ## Code Style
23
+
24
+ - Use **type hints** everywhere (Python ≥ 3.10)
25
+ - Follow **PEP 8** style guidelines
26
+ - Use **ruff** for linting (configuration in `pyproject.toml`)
27
+ - Use **pydantic** models for all data schemas
28
+ - Write **docstrings** for all public functions and classes
29
+
30
+ ## Testing
31
+
32
+ - Write tests for all new features
33
+ - Aim for high test coverage
34
+ - Tests should be in `tests/` directory
35
+ - Run tests with: `pytest`
36
+
37
+ ## Pull Request Process
38
+
39
+ 1. Create a feature branch from `main`
40
+ 2. Make your changes
41
+ 3. Add tests for new functionality
42
+ 4. Ensure all tests pass: `pytest`
43
+ 5. Run linting: `ruff check .`
44
+ 6. Update documentation if needed
45
+ 7. Submit a pull request with a clear description
46
+
47
+ ## Commit Messages
48
+
49
+ - Use clear, descriptive commit messages
50
+ - Reference issue numbers if applicable
51
+ - Follow conventional commit format when possible
52
+
53
+ ## Questions?
54
+
55
+ Open an issue on GitHub for questions or discussions.
56
+
57
+ Thank you for contributing!
@@ -0,0 +1,17 @@
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ Copyright 2026 Pranav Motarwar
6
+
7
+ Licensed under the Apache License, Version 2.0 (the "License");
8
+ you may not use this file except in compliance with the License.
9
+ You may obtain a copy of the License at
10
+
11
+ http://www.apache.org/licenses/LICENSE-2.0
12
+
13
+ Unless required by applicable law or agreed to in writing, software
14
+ distributed under the License is distributed on an "AS IS" BASIS,
15
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16
+ See the License for the specific language governing permissions and
17
+ limitations under the License.
@@ -0,0 +1,7 @@
1
+ include README.md
2
+ include LICENSE
3
+ include CONTRIBUTING.md
4
+ include pyproject.toml
5
+ recursive-include examples *
6
+ recursive-exclude * __pycache__
7
+ recursive-exclude * *.py[co]
@@ -0,0 +1,315 @@
1
+ Metadata-Version: 2.4
2
+ Name: raglineage
3
+ Version: 0.1.1
4
+ Summary: Lineage-aware RAG engine for auditable, reproducible, versioned retrieval and answers
5
+ Author-email: Pranav Motarwar <pranav.motarwar@example.com>
6
+ License: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/PranavMotarwar/raglineage
8
+ Project-URL: Documentation, https://github.com/PranavMotarwar/raglineage
9
+ Project-URL: Repository, https://github.com/PranavMotarwar/raglineage
10
+ Project-URL: Issues, https://github.com/PranavMotarwar/raglineage/issues
11
+ Keywords: rag,lineage,provenance,vector-search,nlp,llm
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: Apache Software License
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
22
+ Requires-Python: >=3.10
23
+ Description-Content-Type: text/markdown
24
+ License-File: LICENSE
25
+ Requires-Dist: pydantic>=2.0.0
26
+ Requires-Dist: networkx>=3.0
27
+ Requires-Dist: faiss-cpu>=1.7.4
28
+ Requires-Dist: sentence-transformers>=2.2.0
29
+ Requires-Dist: typer>=0.9.0
30
+ Requires-Dist: rich>=13.0.0
31
+ Requires-Dist: pyyaml>=6.0
32
+ Provides-Extra: dev
33
+ Requires-Dist: pytest>=7.4.0; extra == "dev"
34
+ Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
35
+ Requires-Dist: ruff>=0.1.0; extra == "dev"
36
+ Requires-Dist: mypy>=1.5.0; extra == "dev"
37
+ Provides-Extra: openai
38
+ Requires-Dist: openai>=1.0.0; extra == "openai"
39
+ Dynamic: license-file
40
+
41
+ # raglineage
42
+
43
+ **Lineage-aware RAG engine for auditable, reproducible, versioned retrieval and answers**
44
+
45
+ [![PyPI version](https://badge.fury.io/py/raglineage.svg)](https://badge.fury.io/py/raglineage)
46
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
47
+ [![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
48
+ [![PyPI downloads](https://img.shields.io/pypi/dm/raglineage.svg)](https://pypi.org/project/raglineage/)
49
+
50
+ ## The Unique Idea
51
+
52
+ Most RAG tools store text chunks and embeddings. They lose provenance and cannot explain answer drift.
53
+
54
+ **raglineage** treats RAG as a data lineage and provenance problem, not just vector search. Every retrievable unit is a **Lineage Node (LN)** with:
55
+
56
+ - Immutable ID and dataset version
57
+ - Precise source reference (file path, page, row, URL, etc.)
58
+ - Full transform chain (ordered list of transforms applied)
59
+ - Content hash for integrity
60
+ - Timestamps for auditing
61
+
62
+ The system maintains a **Lineage Graph (DAG)** linking nodes through structural and semantic relationships, enabling:
63
+
64
+ - Dataset versioning and diffing
65
+ - Incremental rebuilds (only recompute what changed)
66
+ - Answer auditing (reconstruct provenance of any answer)
67
+ - Version consistency checks
68
+ - Staleness detection
69
+
70
+ This is **not** a LangChain/LlamaIndex wrapper—it's a first-class lineage system.
71
+
72
+ ## Architecture
73
+
74
+ ```
75
+ ┌─────────────────────────────────────────────────────────────┐
76
+ │ Data Sources │
77
+ │ (PDFs, CSVs, JSON, APIs, Text Files) │
78
+ └──────────────────────┬──────────────────────────────────────┘
79
+
80
+
81
+ ┌─────────────────────────────────────────────────────────────┐
82
+ │ Ingestion Layer │
83
+ │ AutoIngestor → FileIngestor → TabularIngestor │
84
+ └──────────────────────┬──────────────────────────────────────┘
85
+
86
+
87
+ ┌─────────────────────────────────────────────────────────────┐
88
+ │ Transform Layer │
89
+ │ Chunkers → Dedupe → Normalize │
90
+ │ (Each transform recorded in transform_chain) │
91
+ └──────────────────────┬──────────────────────────────────────┘
92
+
93
+
94
+ ┌─────────────────────────────────────────────────────────────┐
95
+ │ Lineage Node Creation │
96
+ │ ln_id, source, transform_chain, content_hash, version │
97
+ └──────────────────────┬──────────────────────────────────────┘
98
+
99
+
100
+ ┌─────────────────────────────────────────────────────────────┐
101
+ │ Lineage Graph (DAG) │
102
+ │ networkx DAG: nodes=LN, edges=relationships │
103
+ └──────────────────────┬──────────────────────────────────────┘
104
+
105
+
106
+ ┌─────────────────────────────────────────────────────────────┐
107
+ │ Embedding + Vector Store │
108
+ │ Embeddings → FAISS Store → LN ID Mapping │
109
+ └──────────────────────┬──────────────────────────────────────┘
110
+
111
+
112
+ ┌─────────────────────────────────────────────────────────────┐
113
+ │ Retrieval + Audit │
114
+ │ Query → Top-K → Graph Walk → Answer + Lineage │
115
+ │ Audit → Version Check → Staleness → Risk Flags │
116
+ └─────────────────────────────────────────────────────────────┘
117
+ ```
118
+
119
+ ## Lineage Node Example
120
+
121
+ Every retrievable chunk is a Lineage Node with complete provenance:
122
+
123
+ ```json
124
+ {
125
+ "ln_id": "ln_92af",
126
+ "content": "Revenue declined due to supply constraints",
127
+ "source": {
128
+ "type": "pdf",
129
+ "uri": "data/10Q_Q3_2023.pdf",
130
+ "page": 14,
131
+ "section": "Management Discussion"
132
+ },
133
+ "dataset_version": "v3.1",
134
+ "transform_chain": [
135
+ "pdf_parse",
136
+ "section_split",
137
+ "semantic_chunk",
138
+ "deduplicate"
139
+ ],
140
+ "content_hash": "sha256:a3f5b8c9d2e1f4a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0",
141
+ "created_at": "2026-01-20T00:00:00Z"
142
+ }
143
+ ```
144
+
145
+ ## Audited Answer Example
146
+
147
+ Every answer includes full lineage and audit metadata:
148
+
149
+ ```json
150
+ {
151
+ "question": "Why did revenue fall in Q3?",
152
+ "answer": "Revenue declined primarily due to supply constraints affecting shipments.",
153
+ "lineage": [
154
+ {
155
+ "ln_id": "ln_92af",
156
+ "score": 0.91,
157
+ "source": {
158
+ "uri": "data/10Q_Q3_2023.pdf",
159
+ "page": 14
160
+ },
161
+ "dataset_version": "v3.1",
162
+ "transform_chain": ["pdf_parse","section_split","semantic_chunk","deduplicate"]
163
+ }
164
+ ],
165
+ "audit": {
166
+ "staleness_check": "pass",
167
+ "version_consistency": "single_version",
168
+ "transform_risk_flags": []
169
+ }
170
+ }
171
+ ```
172
+
173
+ ## Quickstart
174
+
175
+ ### Installation
176
+
177
+ ```bash
178
+ pip install raglineage
179
+ ```
180
+
181
+ ### Basic Usage
182
+
183
+ ```python
184
+ from raglineage import RagLineage
185
+
186
+ rag = RagLineage(
187
+ source="examples/data",
188
+ store_backend="faiss",
189
+ embed_backend="local"
190
+ )
191
+
192
+ # Build initial version
193
+ rag.build(version="v1.0")
194
+
195
+ # Query with lineage
196
+ ans = rag.query("What is the refund policy?", k=5)
197
+ print(ans.model_dump_json(indent=2))
198
+
199
+ # Audit the answer
200
+ report = rag.audit(ans)
201
+ print(report.model_dump_json(indent=2))
202
+ ```
203
+
204
+ ### CLI Usage
205
+
206
+ ```bash
207
+ # Initialize a project
208
+ raglineage init ./my_project
209
+
210
+ # Build from source
211
+ raglineage build --source ./data --version v1.0
212
+
213
+ # Update incrementally
214
+ raglineage update --source ./data --version v1.1 --changed-only
215
+
216
+ # Query
217
+ raglineage query "What is the refund policy?" --k 5
218
+
219
+ # Diff versions
220
+ raglineage diff v1.0 v1.1
221
+ ```
222
+
223
+ ## Comparison with Other RAG Tools
224
+
225
+ | Feature | raglineage | LangChain | LlamaIndex |
226
+ |---------|-----------|-----------|------------|
227
+ | **Lineage Tracking** | First-class | Not built-in | Not built-in |
228
+ | **Dataset Versioning** | Native | Manual | Manual |
229
+ | **Incremental Updates** | Automatic | Full rebuild | Full rebuild |
230
+ | **Answer Auditing** | Built-in | Manual | Manual |
231
+ | **Transform Chain Tracking** | Every LN | Not tracked | Not tracked |
232
+ | **Version Diffing** | Structured | Not available | Not available |
233
+ | **Graph Relationships** | DAG-based | Optional | Optional |
234
+ | **Source Provenance** | Complete | Basic | Basic |
235
+
236
+ **Key Difference**: raglineage treats lineage as a core requirement, not an afterthought. Every operation preserves and tracks provenance.
237
+
238
+ ## Core Concepts
239
+
240
+ ### Lineage Nodes (LN)
241
+
242
+ A Lineage Node is the atomic unit of retrieval. Each LN has:
243
+ - **ln_id**: Stable, deterministic identifier
244
+ - **content**: The actual text content
245
+ - **source**: Precise reference to origin (file, page, row, etc.)
246
+ - **dataset_version**: Version tag for the dataset
247
+ - **transform_chain**: Ordered list of transforms applied
248
+ - **content_hash**: SHA-256 hash for integrity
249
+ - **timestamps**: Created/updated timestamps
250
+
251
+ ### Lineage Graph
252
+
253
+ A directed acyclic graph (DAG) where:
254
+ - **Nodes**: Lineage Node IDs
255
+ - **Edges**: Typed relationships (adjacent, semantic, references, same_entity, etc.)
256
+
257
+ Enables graph-walk retrieval and relationship exploration.
258
+
259
+ ### Dataset Versioning
260
+
261
+ Each dataset build produces a versioned manifest:
262
+ - Tracks all source files and their hashes
263
+ - Enables diffing between versions
264
+ - Supports incremental updates (only recompute changed files)
265
+
266
+ ### Answer Auditing
267
+
268
+ Every answer includes:
269
+ - **Lineage**: List of LNs used with scores and metadata
270
+ - **Audit Report**:
271
+ - Version consistency check
272
+ - Staleness detection
273
+ - Transform risk flags
274
+
275
+ ## Requirements
276
+
277
+ - Python ≥ 3.10
278
+ - Strict type hints throughout
279
+ - Pydantic models for schemas
280
+ - NetworkX for graph operations
281
+ - FAISS for vector storage
282
+ - Sentence-transformers for local embeddings
283
+
284
+ ## Development
285
+
286
+ ```bash
287
+ # Clone repository
288
+ git clone https://github.com/PranavMotarwar/raglineage.git
289
+ cd raglineage
290
+
291
+ # Install in development mode
292
+ pip install -e ".[dev]"
293
+
294
+ # Run tests
295
+ pytest
296
+
297
+ # Run linting
298
+ ruff check .
299
+ ```
300
+
301
+ ## Contributing
302
+
303
+ See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
304
+
305
+ ## License
306
+
307
+ Apache-2.0 License. See [LICENSE](LICENSE) for details.
308
+
309
+ ## Author
310
+
311
+ Pranav Motarwar - [GitHub](https://github.com/PranavMotarwar)
312
+
313
+ ---
314
+
315
+ **raglineage** - Where every answer has a traceable origin.
@@ -0,0 +1,275 @@
1
+ # raglineage
2
+
3
+ **Lineage-aware RAG engine for auditable, reproducible, versioned retrieval and answers**
4
+
5
+ [![PyPI version](https://badge.fury.io/py/raglineage.svg)](https://badge.fury.io/py/raglineage)
6
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
7
+ [![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
8
+ [![PyPI downloads](https://img.shields.io/pypi/dm/raglineage.svg)](https://pypi.org/project/raglineage/)
9
+
10
+ ## The Unique Idea
11
+
12
+ Most RAG tools store text chunks and embeddings. They lose provenance and cannot explain answer drift.
13
+
14
+ **raglineage** treats RAG as a data lineage and provenance problem, not just vector search. Every retrievable unit is a **Lineage Node (LN)** with:
15
+
16
+ - Immutable ID and dataset version
17
+ - Precise source reference (file path, page, row, URL, etc.)
18
+ - Full transform chain (ordered list of transforms applied)
19
+ - Content hash for integrity
20
+ - Timestamps for auditing
21
+
22
+ The system maintains a **Lineage Graph (DAG)** linking nodes through structural and semantic relationships, enabling:
23
+
24
+ - Dataset versioning and diffing
25
+ - Incremental rebuilds (only recompute what changed)
26
+ - Answer auditing (reconstruct provenance of any answer)
27
+ - Version consistency checks
28
+ - Staleness detection
29
+
30
+ This is **not** a LangChain/LlamaIndex wrapper—it's a first-class lineage system.
31
+
32
+ ## Architecture
33
+
34
+ ```
35
+ ┌─────────────────────────────────────────────────────────────┐
36
+ │ Data Sources │
37
+ │ (PDFs, CSVs, JSON, APIs, Text Files) │
38
+ └──────────────────────┬──────────────────────────────────────┘
39
+
40
+
41
+ ┌─────────────────────────────────────────────────────────────┐
42
+ │ Ingestion Layer │
43
+ │ AutoIngestor → FileIngestor → TabularIngestor │
44
+ └──────────────────────┬──────────────────────────────────────┘
45
+
46
+
47
+ ┌─────────────────────────────────────────────────────────────┐
48
+ │ Transform Layer │
49
+ │ Chunkers → Dedupe → Normalize │
50
+ │ (Each transform recorded in transform_chain) │
51
+ └──────────────────────┬──────────────────────────────────────┘
52
+
53
+
54
+ ┌─────────────────────────────────────────────────────────────┐
55
+ │ Lineage Node Creation │
56
+ │ ln_id, source, transform_chain, content_hash, version │
57
+ └──────────────────────┬──────────────────────────────────────┘
58
+
59
+
60
+ ┌─────────────────────────────────────────────────────────────┐
61
+ │ Lineage Graph (DAG) │
62
+ │ networkx DAG: nodes=LN, edges=relationships │
63
+ └──────────────────────┬──────────────────────────────────────┘
64
+
65
+
66
+ ┌─────────────────────────────────────────────────────────────┐
67
+ │ Embedding + Vector Store │
68
+ │ Embeddings → FAISS Store → LN ID Mapping │
69
+ └──────────────────────┬──────────────────────────────────────┘
70
+
71
+
72
+ ┌─────────────────────────────────────────────────────────────┐
73
+ │ Retrieval + Audit │
74
+ │ Query → Top-K → Graph Walk → Answer + Lineage │
75
+ │ Audit → Version Check → Staleness → Risk Flags │
76
+ └─────────────────────────────────────────────────────────────┘
77
+ ```
78
+
79
+ ## Lineage Node Example
80
+
81
+ Every retrievable chunk is a Lineage Node with complete provenance:
82
+
83
+ ```json
84
+ {
85
+ "ln_id": "ln_92af",
86
+ "content": "Revenue declined due to supply constraints",
87
+ "source": {
88
+ "type": "pdf",
89
+ "uri": "data/10Q_Q3_2023.pdf",
90
+ "page": 14,
91
+ "section": "Management Discussion"
92
+ },
93
+ "dataset_version": "v3.1",
94
+ "transform_chain": [
95
+ "pdf_parse",
96
+ "section_split",
97
+ "semantic_chunk",
98
+ "deduplicate"
99
+ ],
100
+ "content_hash": "sha256:a3f5b8c9d2e1f4a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0",
101
+ "created_at": "2026-01-20T00:00:00Z"
102
+ }
103
+ ```
104
+
105
+ ## Audited Answer Example
106
+
107
+ Every answer includes full lineage and audit metadata:
108
+
109
+ ```json
110
+ {
111
+ "question": "Why did revenue fall in Q3?",
112
+ "answer": "Revenue declined primarily due to supply constraints affecting shipments.",
113
+ "lineage": [
114
+ {
115
+ "ln_id": "ln_92af",
116
+ "score": 0.91,
117
+ "source": {
118
+ "uri": "data/10Q_Q3_2023.pdf",
119
+ "page": 14
120
+ },
121
+ "dataset_version": "v3.1",
122
+ "transform_chain": ["pdf_parse","section_split","semantic_chunk","deduplicate"]
123
+ }
124
+ ],
125
+ "audit": {
126
+ "staleness_check": "pass",
127
+ "version_consistency": "single_version",
128
+ "transform_risk_flags": []
129
+ }
130
+ }
131
+ ```
132
+
133
+ ## Quickstart
134
+
135
+ ### Installation
136
+
137
+ ```bash
138
+ pip install raglineage
139
+ ```
140
+
141
+ ### Basic Usage
142
+
143
+ ```python
144
+ from raglineage import RagLineage
145
+
146
+ rag = RagLineage(
147
+ source="examples/data",
148
+ store_backend="faiss",
149
+ embed_backend="local"
150
+ )
151
+
152
+ # Build initial version
153
+ rag.build(version="v1.0")
154
+
155
+ # Query with lineage
156
+ ans = rag.query("What is the refund policy?", k=5)
157
+ print(ans.model_dump_json(indent=2))
158
+
159
+ # Audit the answer
160
+ report = rag.audit(ans)
161
+ print(report.model_dump_json(indent=2))
162
+ ```
163
+
164
+ ### CLI Usage
165
+
166
+ ```bash
167
+ # Initialize a project
168
+ raglineage init ./my_project
169
+
170
+ # Build from source
171
+ raglineage build --source ./data --version v1.0
172
+
173
+ # Update incrementally
174
+ raglineage update --source ./data --version v1.1 --changed-only
175
+
176
+ # Query
177
+ raglineage query "What is the refund policy?" --k 5
178
+
179
+ # Diff versions
180
+ raglineage diff v1.0 v1.1
181
+ ```
182
+
183
+ ## Comparison with Other RAG Tools
184
+
185
+ | Feature | raglineage | LangChain | LlamaIndex |
186
+ |---------|-----------|-----------|------------|
187
+ | **Lineage Tracking** | First-class | Not built-in | Not built-in |
188
+ | **Dataset Versioning** | Native | Manual | Manual |
189
+ | **Incremental Updates** | Automatic | Full rebuild | Full rebuild |
190
+ | **Answer Auditing** | Built-in | Manual | Manual |
191
+ | **Transform Chain Tracking** | Every LN | Not tracked | Not tracked |
192
+ | **Version Diffing** | Structured | Not available | Not available |
193
+ | **Graph Relationships** | DAG-based | Optional | Optional |
194
+ | **Source Provenance** | Complete | Basic | Basic |
195
+
196
+ **Key Difference**: raglineage treats lineage as a core requirement, not an afterthought. Every operation preserves and tracks provenance.
197
+
198
+ ## Core Concepts
199
+
200
+ ### Lineage Nodes (LN)
201
+
202
+ A Lineage Node is the atomic unit of retrieval. Each LN has:
203
+ - **ln_id**: Stable, deterministic identifier
204
+ - **content**: The actual text content
205
+ - **source**: Precise reference to origin (file, page, row, etc.)
206
+ - **dataset_version**: Version tag for the dataset
207
+ - **transform_chain**: Ordered list of transforms applied
208
+ - **content_hash**: SHA-256 hash for integrity
209
+ - **timestamps**: Created/updated timestamps
210
+
211
+ ### Lineage Graph
212
+
213
+ A directed acyclic graph (DAG) where:
214
+ - **Nodes**: Lineage Node IDs
215
+ - **Edges**: Typed relationships (adjacent, semantic, references, same_entity, etc.)
216
+
217
+ Enables graph-walk retrieval and relationship exploration.
218
+
219
+ ### Dataset Versioning
220
+
221
+ Each dataset build produces a versioned manifest:
222
+ - Tracks all source files and their hashes
223
+ - Enables diffing between versions
224
+ - Supports incremental updates (only recompute changed files)
225
+
226
+ ### Answer Auditing
227
+
228
+ Every answer includes:
229
+ - **Lineage**: List of LNs used with scores and metadata
230
+ - **Audit Report**:
231
+ - Version consistency check
232
+ - Staleness detection
233
+ - Transform risk flags
234
+
235
+ ## Requirements
236
+
237
+ - Python ≥ 3.10
238
+ - Strict type hints throughout
239
+ - Pydantic models for schemas
240
+ - NetworkX for graph operations
241
+ - FAISS for vector storage
242
+ - Sentence-transformers for local embeddings
243
+
244
+ ## Development
245
+
246
+ ```bash
247
+ # Clone repository
248
+ git clone https://github.com/PranavMotarwar/raglineage.git
249
+ cd raglineage
250
+
251
+ # Install in development mode
252
+ pip install -e ".[dev]"
253
+
254
+ # Run tests
255
+ pytest
256
+
257
+ # Run linting
258
+ ruff check .
259
+ ```
260
+
261
+ ## Contributing
262
+
263
+ See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
264
+
265
+ ## License
266
+
267
+ Apache-2.0 License. See [LICENSE](LICENSE) for details.
268
+
269
+ ## Author
270
+
271
+ Pranav Motarwar - [GitHub](https://github.com/PranavMotarwar)
272
+
273
+ ---
274
+
275
+ **raglineage** - Where every answer has a traceable origin.
@@ -0,0 +1,6 @@
1
+ product_id,name,price,category
2
+ 1,Widget A,29.99,Electronics
3
+ 2,Widget B,49.99,Electronics
4
+ 3,Gadget X,19.99,Accessories
5
+ 4,Gadget Y,39.99,Accessories
6
+ 5,Tool Z,59.99,Tools
@@ -0,0 +1,15 @@
1
+ Refund Policy
2
+
3
+ Our refund policy is straightforward. Customers can request a refund within 30 days of purchase.
4
+ Refunds are processed within 5-7 business days. To request a refund, please contact our support team
5
+ with your order number and reason for the refund.
6
+
7
+ Shipping Policy
8
+
9
+ We offer free shipping on orders over $50. Standard shipping takes 5-7 business days.
10
+ Express shipping is available for an additional fee and takes 2-3 business days.
11
+
12
+ Privacy Policy
13
+
14
+ We respect your privacy. We do not sell your personal information to third parties.
15
+ Your data is encrypted and stored securely. You can request deletion of your data at any time.