adaptive-oci-chunking 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- adaptive_oci_chunking-0.1.0/.env.example +8 -0
- adaptive_oci_chunking-0.1.0/.github/workflows/ci.yml +21 -0
- adaptive_oci_chunking-0.1.0/.gitignore +13 -0
- adaptive_oci_chunking-0.1.0/Architecture.png +0 -0
- adaptive_oci_chunking-0.1.0/CONTRIBUTING.md +105 -0
- adaptive_oci_chunking-0.1.0/LICENSE +21 -0
- adaptive_oci_chunking-0.1.0/PKG-INFO +343 -0
- adaptive_oci_chunking-0.1.0/README.md +298 -0
- adaptive_oci_chunking-0.1.0/examples/basic_adaptive_chunking.py +36 -0
- adaptive_oci_chunking-0.1.0/examples/custom_selector.py +54 -0
- adaptive_oci_chunking-0.1.0/examples/langchain_integration.py +25 -0
- adaptive_oci_chunking-0.1.0/examples/llama_index_integration.py +23 -0
- adaptive_oci_chunking-0.1.0/examples/oci_object_storage.py +21 -0
- adaptive_oci_chunking-0.1.0/examples/sample.md +12 -0
- adaptive_oci_chunking-0.1.0/pyproject.toml +66 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/__init__.py +5 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/api.py +40 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/chunkers.py +342 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/cli.py +62 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/io.py +24 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/langchain.py +33 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/llama_index.py +74 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/metrics.py +287 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/models.py +61 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/oci.py +73 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/pipeline.py +25 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/selector.py +30 -0
- adaptive_oci_chunking-0.1.0/src/adaptive_chunking/text.py +71 -0
- adaptive_oci_chunking-0.1.0/tests/test_adaptive_chunking.py +186 -0
|
@@ -0,0 +1,8 @@
|
|
|
1
|
+
OCI_CONFIG_FILE=~/.oci/config
|
|
2
|
+
OCI_PROFILE=DEFAULT
|
|
3
|
+
OCI_COMPARTMENT_ID=ocid1.compartment.oc1..example
|
|
4
|
+
OCI_GENAI_ENDPOINT=https://inference.generativeai.us-chicago-1.oci.oraclecloud.com
|
|
5
|
+
OCI_GENAI_EMBEDDING_MODEL=cohere.embed-english-v3.0
|
|
6
|
+
OCI_OBJECT_STORAGE_NAMESPACE=my-namespace
|
|
7
|
+
OCI_OBJECT_STORAGE_BUCKET=my-bucket
|
|
8
|
+
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
name: CI
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
pull_request:
|
|
6
|
+
|
|
7
|
+
jobs:
|
|
8
|
+
test:
|
|
9
|
+
runs-on: ubuntu-latest
|
|
10
|
+
strategy:
|
|
11
|
+
matrix:
|
|
12
|
+
python-version: ["3.10", "3.11", "3.12"]
|
|
13
|
+
steps:
|
|
14
|
+
- uses: actions/checkout@v4
|
|
15
|
+
- uses: actions/setup-python@v5
|
|
16
|
+
with:
|
|
17
|
+
python-version: ${{ matrix.python-version }}
|
|
18
|
+
- run: python -m pip install --upgrade pip
|
|
19
|
+
- run: pip install -e ".[dev]"
|
|
20
|
+
- run: ruff check .
|
|
21
|
+
- run: pytest
|
|
Binary file
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
# Contributing
|
|
2
|
+
|
|
3
|
+
Thanks for helping improve Adaptive OCI Chunking. This project is intended to be useful for practitioners building RAG systems, researchers testing chunking strategies, and maintainers who want a clean place to compare document-splitting ideas.
|
|
4
|
+
|
|
5
|
+
Maintainer: [Yash Shukla](https://www.linkedin.com/in/yashtechi/), focused on AI, cloud, and RAG systems.
|
|
6
|
+
|
|
7
|
+
## Ways to Contribute
|
|
8
|
+
|
|
9
|
+
- Add new chunkers for specific document structures or domains.
|
|
10
|
+
- Improve intrinsic metrics or add new evaluation dimensions.
|
|
11
|
+
- Add examples for LangChain, LlamaIndex, OCI, or other RAG workflows.
|
|
12
|
+
- Improve tests, documentation, type hints, and packaging.
|
|
13
|
+
- Report bugs with small reproducible examples.
|
|
14
|
+
- Share benchmark results from real document collections.
|
|
15
|
+
|
|
16
|
+
## Development Setup
|
|
17
|
+
|
|
18
|
+
Clone the repo and install it in editable mode:
|
|
19
|
+
|
|
20
|
+
```bash
|
|
21
|
+
python -m venv .venv
|
|
22
|
+
.venv\Scripts\activate
|
|
23
|
+
pip install -e ".[dev]"
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
On macOS or Linux:
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
python -m venv .venv
|
|
30
|
+
source .venv/bin/activate
|
|
31
|
+
pip install -e ".[dev]"
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
If you only want optional integration support:
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
pip install -e ".[langchain,llama-index,oci,api]"
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
## Running Checks
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
ruff check .
|
|
44
|
+
pytest
|
|
45
|
+
python -m compileall src tests examples
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
If you add an example, make sure it either runs without optional credentials or clearly documents the required environment variables.
|
|
49
|
+
|
|
50
|
+
## Adding a Chunker
|
|
51
|
+
|
|
52
|
+
Chunkers live in `src/adaptive_chunking/chunkers.py` and implement `BaseChunker`.
|
|
53
|
+
|
|
54
|
+
A good chunker should:
|
|
55
|
+
|
|
56
|
+
- Preserve source order.
|
|
57
|
+
- Return non-empty `Chunk` objects with stable `start_char` and `end_char` spans.
|
|
58
|
+
- Avoid silently dropping text.
|
|
59
|
+
- Fall back gracefully when its preferred structure is not present.
|
|
60
|
+
- Include focused tests in `tests/test_adaptive_chunking.py`.
|
|
61
|
+
- Be added to `default_chunkers()` only if it is broadly useful.
|
|
62
|
+
|
|
63
|
+
## Adding a Metric
|
|
64
|
+
|
|
65
|
+
Metrics live in `src/adaptive_chunking/metrics.py`.
|
|
66
|
+
|
|
67
|
+
A good metric should:
|
|
68
|
+
|
|
69
|
+
- Return a bounded score from `0.0` to `1.0`.
|
|
70
|
+
- Be explainable from document and chunk structure alone.
|
|
71
|
+
- Have a default weight in `MetricWeights`.
|
|
72
|
+
- Include an explanation string in `IntrinsicMetricEvaluator.evaluate`.
|
|
73
|
+
- Include tests for normal and edge cases.
|
|
74
|
+
|
|
75
|
+
## Pull Request Checklist
|
|
76
|
+
|
|
77
|
+
Before opening a PR:
|
|
78
|
+
|
|
79
|
+
- Run `ruff check .`.
|
|
80
|
+
- Run `pytest`.
|
|
81
|
+
- Run `python -m compileall src tests examples`.
|
|
82
|
+
- Update README or examples when behavior changes.
|
|
83
|
+
- Add or update tests for code changes.
|
|
84
|
+
- Keep changes focused on one concern where possible.
|
|
85
|
+
|
|
86
|
+
## Design Principles
|
|
87
|
+
|
|
88
|
+
- The core package should stay dependency-light.
|
|
89
|
+
- Optional integrations should import their heavy dependencies only when used.
|
|
90
|
+
- Chunking behavior should be inspectable and explainable.
|
|
91
|
+
- Metrics should help users understand tradeoffs, not hide them behind a black box.
|
|
92
|
+
- OCI support should remain optional.
|
|
93
|
+
|
|
94
|
+
## Reporting Issues
|
|
95
|
+
|
|
96
|
+
Please include:
|
|
97
|
+
|
|
98
|
+
- Python version.
|
|
99
|
+
- Installation command.
|
|
100
|
+
- Minimal input text or document shape.
|
|
101
|
+
- Expected chunking behavior.
|
|
102
|
+
- Actual chunking behavior.
|
|
103
|
+
- Any traceback or metric output.
|
|
104
|
+
|
|
105
|
+
For private or sensitive documents, replace content with a synthetic example that preserves the relevant structure.
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Yash Shukla
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,343 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: adaptive-oci-chunking
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Adaptive document chunking for RAG with optional Oracle Cloud Infrastructure integrations.
|
|
5
|
+
Project-URL: Repository, https://github.com/CaptnSalazar/adaptive-oci-chunking
|
|
6
|
+
Project-URL: LinkedIn, https://www.linkedin.com/in/yashtechi/
|
|
7
|
+
Author: Yash Shukla
|
|
8
|
+
License-Expression: MIT
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Keywords: adaptive-chunking,chunking,document-ai,genai,oci,rag
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
+
Classifier: Programming Language :: Python :: 3
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Topic :: Text Processing
|
|
19
|
+
Requires-Python: >=3.10
|
|
20
|
+
Requires-Dist: pydantic>=2.7
|
|
21
|
+
Requires-Dist: rich>=13.7
|
|
22
|
+
Requires-Dist: typer>=0.12
|
|
23
|
+
Provides-Extra: api
|
|
24
|
+
Requires-Dist: fastapi>=0.111; extra == 'api'
|
|
25
|
+
Requires-Dist: uvicorn[standard]>=0.30; extra == 'api'
|
|
26
|
+
Provides-Extra: dev
|
|
27
|
+
Requires-Dist: fastapi>=0.111; extra == 'dev'
|
|
28
|
+
Requires-Dist: langchain-text-splitters>=0.2; extra == 'dev'
|
|
29
|
+
Requires-Dist: llama-index-core>=0.10; extra == 'dev'
|
|
30
|
+
Requires-Dist: mypy>=1.10; extra == 'dev'
|
|
31
|
+
Requires-Dist: oci>=2.130; extra == 'dev'
|
|
32
|
+
Requires-Dist: pytest>=8.0; extra == 'dev'
|
|
33
|
+
Requires-Dist: ruff>=0.5; extra == 'dev'
|
|
34
|
+
Requires-Dist: uvicorn[standard]>=0.30; extra == 'dev'
|
|
35
|
+
Provides-Extra: integrations
|
|
36
|
+
Requires-Dist: langchain-text-splitters>=0.2; extra == 'integrations'
|
|
37
|
+
Requires-Dist: llama-index-core>=0.10; extra == 'integrations'
|
|
38
|
+
Provides-Extra: langchain
|
|
39
|
+
Requires-Dist: langchain-text-splitters>=0.2; extra == 'langchain'
|
|
40
|
+
Provides-Extra: llama-index
|
|
41
|
+
Requires-Dist: llama-index-core>=0.10; extra == 'llama-index'
|
|
42
|
+
Provides-Extra: oci
|
|
43
|
+
Requires-Dist: oci>=2.130; extra == 'oci'
|
|
44
|
+
Description-Content-Type: text/markdown
|
|
45
|
+
|
|
46
|
+
<div align="center">
|
|
47
|
+
|
|
48
|
+
# Adaptive OCI Chunking
|
|
49
|
+
|
|
50
|
+
**Adaptive chunking toolkit for RAG with OCI, LangChain, and LlamaIndex support**
|
|
51
|
+
|
|
52
|
+
[](https://github.com/CaptnSalazar/adaptive-oci-chunking/actions/workflows/ci.yml)
|
|
53
|
+
[](LICENSE)
|
|
54
|
+
[](https://www.python.org/downloads/)
|
|
55
|
+
[](https://arxiv.org/abs/2603.25333)
|
|
56
|
+
|
|
57
|
+
</div>
|
|
58
|
+
|
|
59
|
+
Adaptive OCI Chunking is an extensible Python implementation for document-aware chunk selection in Retrieval-Augmented Generation (RAG). It is inspired by Ekimetrics' `adaptive-chunking` repository and the paper _Adaptive Chunking: Optimizing Chunking-Method Selection for RAG_.
|
|
60
|
+
|
|
61
|
+
The package evaluates several chunking strategies for each document, scores them with intrinsic metrics, and selects the best candidate before indexing or generation. Oracle Cloud Infrastructure (OCI) integrations are optional: the core chunking engine runs locally, while OCI Object Storage and Generative AI can be enabled when needed.
|
|
62
|
+
|
|
63
|
+
## Architecture
|
|
64
|
+
|
|
65
|
+

|
|
66
|
+
|
|
67
|
+
## What is Adaptive Chunking?
|
|
68
|
+
|
|
69
|
+
No single chunking method works best for every document in a RAG pipeline. Adaptive chunking treats chunking as a selection problem: try multiple splitting strategies, score each result with intrinsic quality metrics, and choose the best candidate for the document at hand.
|
|
70
|
+
|
|
71
|
+
This repo builds on that idea as a practical toolkit. It keeps the core dependency-light, adds extra production-oriented metrics, and includes optional adapters for OCI, LangChain, and LlamaIndex.
|
|
72
|
+
|
|
73
|
+
## Features
|
|
74
|
+
|
|
75
|
+
- Candidate chunkers:
|
|
76
|
+
- single-document
|
|
77
|
+
- fixed window with overlap
|
|
78
|
+
- recursive split
|
|
79
|
+
- split-then-merge
|
|
80
|
+
- section-aware
|
|
81
|
+
- delimiter-aware
|
|
82
|
+
- page-aware
|
|
83
|
+
- semantic lexical drift
|
|
84
|
+
- regex-guided section splitting
|
|
85
|
+
- Metric-guided selection using paper-aligned intrinsic metrics:
|
|
86
|
+
- References Completeness (RC)
|
|
87
|
+
- Intrachunk Cohesion (ICC)
|
|
88
|
+
- Document Contextual Coherence (DCC)
|
|
89
|
+
- Block Integrity (BI)
|
|
90
|
+
- Size Compliance (SC)
|
|
91
|
+
- Additional practical metrics:
|
|
92
|
+
- source coverage
|
|
93
|
+
- overlap control
|
|
94
|
+
- boundary quality
|
|
95
|
+
- semantic drift
|
|
96
|
+
- information density
|
|
97
|
+
- redundancy
|
|
98
|
+
- Weighted strategy selection with explainable per-metric scores.
|
|
99
|
+
- LangChain `TextSplitter` adapter.
|
|
100
|
+
- LlamaIndex node conversion and parser-style adapter.
|
|
101
|
+
- CLI for local text/Markdown files.
|
|
102
|
+
- Optional OCI Object Storage loader and OCI Generative AI embedding adapter.
|
|
103
|
+
- Small, dependency-light core for local document chunking workflows.
|
|
104
|
+
|
|
105
|
+
## Contributing
|
|
106
|
+
|
|
107
|
+
Contributions are welcome for new chunkers, metrics, examples, integrations, benchmarks, documentation, and bug fixes.
|
|
108
|
+
|
|
109
|
+
See [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions, PR expectations, and guidance for adding chunkers or metrics.
|
|
110
|
+
|
|
111
|
+
Maintained by [Yash Shukla](https://www.linkedin.com/in/yashtechi/), focused on AI, cloud, and RAG systems.
|
|
112
|
+
|
|
113
|
+
## Install
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
pip install -e ".[dev]"
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
With OCI support:
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
pip install -e ".[oci]"
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
With the API server:
|
|
126
|
+
|
|
127
|
+
```bash
|
|
128
|
+
pip install -e ".[api]"
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
With framework integrations:
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
pip install -e ".[langchain,llama-index]"
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
## Quick Start
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
adaptive-chunk chunk examples/sample.md --json
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
Python usage:
|
|
144
|
+
|
|
145
|
+
```python
|
|
146
|
+
from adaptive_chunking import AdaptiveChunker
|
|
147
|
+
|
|
148
|
+
text = "## Introduction\nAdaptive chunking chooses a splitter per document.\n\n## Details\n..."
|
|
149
|
+
chunker = AdaptiveChunker()
|
|
150
|
+
result = chunker.chunk(text, document_id="demo")
|
|
151
|
+
|
|
152
|
+
print(result.strategy_name)
|
|
153
|
+
for chunk in result.chunks:
|
|
154
|
+
print(chunk.text)
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
## Examples
|
|
158
|
+
|
|
159
|
+
Runnable examples live in `examples/`:
|
|
160
|
+
|
|
161
|
+
- `basic_adaptive_chunking.py`: end-to-end adaptive selection with metric output.
|
|
162
|
+
- `custom_selector.py`: custom chunker list and metric weights.
|
|
163
|
+
- `langchain_integration.py`: LangChain `TextSplitter` usage.
|
|
164
|
+
- `llama_index_integration.py`: LlamaIndex `TextNode` conversion.
|
|
165
|
+
- `oci_object_storage.py`: loading source text from OCI Object Storage.
|
|
166
|
+
|
|
167
|
+
## Chunker Options
|
|
168
|
+
|
|
169
|
+
```python
|
|
170
|
+
from adaptive_chunking.chunkers import (
|
|
171
|
+
DelimiterChunker,
|
|
172
|
+
PageChunker,
|
|
173
|
+
SectionAwareChunker,
|
|
174
|
+
SemanticChunker,
|
|
175
|
+
)
|
|
176
|
+
from adaptive_chunking.selector import AdaptiveSelector
|
|
177
|
+
from adaptive_chunking import AdaptiveChunker
|
|
178
|
+
|
|
179
|
+
selector = AdaptiveSelector(
|
|
180
|
+
chunkers=[
|
|
181
|
+
SectionAwareChunker(max_size=1800),
|
|
182
|
+
DelimiterChunker(delimiter="\n---\n"),
|
|
183
|
+
PageChunker(page_delimiter="\f"),
|
|
184
|
+
SemanticChunker(max_size=1400, similarity_threshold=0.08),
|
|
185
|
+
]
|
|
186
|
+
)
|
|
187
|
+
|
|
188
|
+
result = AdaptiveChunker(selector=selector).chunk(text)
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
## Metrics
|
|
192
|
+
|
|
193
|
+
The selector ranks every candidate by a weighted average of intrinsic scores. The first five metrics follow the paper's evaluation dimensions; the additional metrics make the implementation more practical for production RAG systems where dropped text, excessive overlap, and duplicated chunks are common failure modes.
|
|
194
|
+
|
|
195
|
+
Weights can be tuned:
|
|
196
|
+
|
|
197
|
+
```python
|
|
198
|
+
from adaptive_chunking.metrics import IntrinsicMetricEvaluator, MetricConfig, MetricWeights
|
|
199
|
+
from adaptive_chunking.selector import AdaptiveSelector
|
|
200
|
+
|
|
201
|
+
weights = MetricWeights(
|
|
202
|
+
block_integrity=1.4,
|
|
203
|
+
coverage=1.5,
|
|
204
|
+
redundancy=0.8,
|
|
205
|
+
)
|
|
206
|
+
evaluator = IntrinsicMetricEvaluator(MetricConfig(weights=weights))
|
|
207
|
+
selector = AdaptiveSelector(evaluator=evaluator)
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## Adaptive Scoring
|
|
211
|
+
|
|
212
|
+
For each document, the selector runs every candidate chunker and evaluates the chunks it produces. Each candidate receives a normalized weighted score:
|
|
213
|
+
|
|
214
|
+
```text
|
|
215
|
+
score(candidate) = sum(metric_value_i * metric_weight_i) / sum(metric_weight_i)
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
Where:
|
|
219
|
+
|
|
220
|
+
- `metric_value_i` is the metric score for a candidate, normalized from `0.0` to `1.0`.
|
|
221
|
+
- `metric_weight_i` controls how important that metric is for selection.
|
|
222
|
+
- Higher scores are better.
|
|
223
|
+
- Candidates are ranked from highest score to lowest score.
|
|
224
|
+
|
|
225
|
+
For example, a domain that cares about preserving source text and section boundaries might emphasize `coverage` and `block_integrity`:
|
|
226
|
+
|
|
227
|
+
| Metric | Value | Weight | Weighted value |
|
|
228
|
+
|--------|------:|-------:|---------------:|
|
|
229
|
+
| coverage | 1.00 | 1.50 | 1.50 |
|
|
230
|
+
| block_integrity | 0.90 | 1.40 | 1.26 |
|
|
231
|
+
| redundancy | 0.80 | 0.80 | 0.64 |
|
|
232
|
+
|
|
233
|
+
```text
|
|
234
|
+
score = (1.50 + 1.26 + 0.64) / (1.50 + 1.40 + 0.80)
|
|
235
|
+
= 3.40 / 3.70
|
|
236
|
+
= 0.919
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
You can inspect every candidate, not just the winner:
|
|
240
|
+
|
|
241
|
+
```python
|
|
242
|
+
from adaptive_chunking import AdaptiveChunker
|
|
243
|
+
|
|
244
|
+
result = AdaptiveChunker().chunk(text, document_id="demo")
|
|
245
|
+
|
|
246
|
+
for candidate in result.candidates:
|
|
247
|
+
print(candidate.strategy_name, round(candidate.score, 3), len(candidate.chunks))
|
|
248
|
+
for metric in candidate.metrics:
|
|
249
|
+
print(" ", metric.name, metric.value, "weight=", metric.weight)
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
This makes the selection process explainable: if a chunker loses, you can see whether it dropped content, produced excessive overlap, cut through structure, or failed a size constraint.
|
|
253
|
+
|
|
254
|
+
## LangChain
|
|
255
|
+
|
|
256
|
+
```python
|
|
257
|
+
from adaptive_chunking.langchain import LangChainAdaptiveTextSplitter
|
|
258
|
+
|
|
259
|
+
splitter = LangChainAdaptiveTextSplitter()
|
|
260
|
+
documents = splitter.create_documents([text])
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
## LlamaIndex
|
|
264
|
+
|
|
265
|
+
```python
|
|
266
|
+
from adaptive_chunking import AdaptiveChunker
|
|
267
|
+
from adaptive_chunking.llama_index import result_to_llama_nodes
|
|
268
|
+
|
|
269
|
+
result = AdaptiveChunker().chunk(text, document_id="policy")
|
|
270
|
+
nodes = result_to_llama_nodes(result)
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
## OCI Usage
|
|
274
|
+
|
|
275
|
+
Copy `.env.example` and set the values for your tenancy and compartment. The core library does not require OCI credentials unless you instantiate an OCI adapter.
|
|
276
|
+
|
|
277
|
+
```python
|
|
278
|
+
from adaptive_chunking.oci import OCIObjectStorageTextLoader
|
|
279
|
+
|
|
280
|
+
loader = OCIObjectStorageTextLoader(
|
|
281
|
+
namespace="my-namespace",
|
|
282
|
+
bucket_name="documents",
|
|
283
|
+
)
|
|
284
|
+
text = loader.load_text("policies/example.md")
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
## API Server
|
|
288
|
+
|
|
289
|
+
```bash
|
|
290
|
+
uvicorn adaptive_chunking.api:app --reload
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
Then post:
|
|
294
|
+
|
|
295
|
+
```bash
|
|
296
|
+
curl -X POST http://127.0.0.1:8000/chunk \
|
|
297
|
+
-H "Content-Type: application/json" \
|
|
298
|
+
-d "{\"text\":\"# Title\nBody text\", \"document_id\":\"demo\"}"
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
## Project Layout
|
|
302
|
+
|
|
303
|
+
```text
|
|
304
|
+
src/adaptive_chunking/
|
|
305
|
+
chunkers.py # candidate splitting strategies
|
|
306
|
+
metrics.py # intrinsic metric implementations
|
|
307
|
+
selector.py # weighted adaptive strategy selection
|
|
308
|
+
pipeline.py # high-level AdaptiveChunker
|
|
309
|
+
langchain.py # optional LangChain TextSplitter adapter
|
|
310
|
+
llama_index.py # optional LlamaIndex node helpers
|
|
311
|
+
oci.py # optional OCI adapters
|
|
312
|
+
api.py # optional FastAPI app
|
|
313
|
+
cli.py # command line interface
|
|
314
|
+
tests/
|
|
315
|
+
examples/
|
|
316
|
+
```
|
|
317
|
+
|
|
318
|
+
## Notes
|
|
319
|
+
|
|
320
|
+
This repo is designed as a clean, extensible foundation rather than a verbatim copy of the reference implementation. The metric implementations are practical approximations intended for engineering use and experimentation. Production RAG deployments should calibrate weights, chunk sizes, and embedding models against their document domains.
|
|
321
|
+
|
|
322
|
+
## References
|
|
323
|
+
|
|
324
|
+
- Ekimetrics reference implementation: [ekimetrics/adaptive-chunking](https://github.com/ekimetrics/adaptive-chunking)
|
|
325
|
+
- Paper: [Adaptive Chunking: Optimizing Chunking-Method Selection for RAG](https://arxiv.org/abs/2603.25333)
|
|
326
|
+
|
|
327
|
+
## Citation
|
|
328
|
+
|
|
329
|
+
If this project helps your work, please cite the original adaptive chunking paper:
|
|
330
|
+
|
|
331
|
+
```bibtex
|
|
332
|
+
@inproceedings{demoura2026adaptive,
|
|
333
|
+
title={Adaptive Chunking: Optimizing Chunking-Method Selection for RAG},
|
|
334
|
+
author={de Moura Junior, Paulo Roberto and Lelong, Jean and Blangero, Annabelle},
|
|
335
|
+
booktitle={Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026)},
|
|
336
|
+
year={2026},
|
|
337
|
+
url={https://arxiv.org/abs/2603.25333},
|
|
338
|
+
}
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
## License
|
|
342
|
+
|
|
343
|
+
This project is licensed under the [MIT License](LICENSE).
|