nzachapy 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- nzachapy-0.1.0/LICENSE +21 -0
- nzachapy-0.1.0/PKG-INFO +124 -0
- nzachapy-0.1.0/README.md +109 -0
- nzachapy-0.1.0/nzachapy/__init__.py +4 -0
- nzachapy-0.1.0/nzachapy/core.py +57 -0
- nzachapy-0.1.0/nzachapy/sementic.py +38 -0
- nzachapy-0.1.0/nzachapy.egg-info/PKG-INFO +124 -0
- nzachapy-0.1.0/nzachapy.egg-info/SOURCES.txt +12 -0
- nzachapy-0.1.0/nzachapy.egg-info/dependency_links.txt +1 -0
- nzachapy-0.1.0/nzachapy.egg-info/requires.txt +4 -0
- nzachapy-0.1.0/nzachapy.egg-info/top_level.txt +1 -0
- nzachapy-0.1.0/pyproject.toml +23 -0
- nzachapy-0.1.0/setup.cfg +4 -0
- nzachapy-0.1.0/tests/test.py +58 -0
nzachapy-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Leonard's Venture
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
nzachapy-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: nzachapy
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Context filtering module that scans your data and returns only relevant portions based on a query.
|
|
5
|
+
Author: Leonard Nwosu
|
|
6
|
+
Project-URL: Homepage, https://github.com/Dblvcksheep/nzachapy
|
|
7
|
+
Requires-Python: >=3.8
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: numpy
|
|
11
|
+
Requires-Dist: sentence-transformers
|
|
12
|
+
Requires-Dist: openai
|
|
13
|
+
Requires-Dist: httpx
|
|
14
|
+
Dynamic: license-file
|
|
15
|
+
|
|
16
|
+
# nzachapy
|
|
17
|
+
|
|
18
|
+
> **Context Filtering for LLMs** — Scan your entire dataset and return only the portion relevant to your query.
|
|
19
|
+
|
|
20
|
+
When working with large datasets and language models, stuffing everything into the context window is noisy and expensive. `nzachapy` solves this by chunking your data, embedding it semantically, and returning only the chunks most relevant to a given query — so your model sees signal, not noise.
|
|
21
|
+
|
|
22
|
+
---
|
|
23
|
+
|
|
24
|
+
## Installation
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
pip install nzachapy
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## Quick Start
|
|
33
|
+
|
|
34
|
+
```python
|
|
35
|
+
from nzachapy import Nzacha
|
|
36
|
+
|
|
37
|
+
# Initialize the module
|
|
38
|
+
nz = Nzacha()
|
|
39
|
+
|
|
40
|
+
# Add your dataset (must be a string)
|
|
41
|
+
nz.add_by_words(data)
|
|
42
|
+
|
|
43
|
+
# Query for relevant context
|
|
44
|
+
results = nz.query("your query here")
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
> ⚠️ **Important:** Your dataset must be converted to a string before passing it to `add_by_words()`.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Configuration
|
|
52
|
+
|
|
53
|
+
When initializing `Nzacha`, you can customize the following parameters. If left unset, the defaults are used.
|
|
54
|
+
|
|
55
|
+
| Parameter | Default | Description |
|
|
56
|
+
|---|---|---|
|
|
57
|
+
| `chunk_size` | `200` | Number of words per chunk |
|
|
58
|
+
| `overlap` | `20` | Number of overlapping words between consecutive chunks |
|
|
59
|
+
| `openai_api_key` | `None` | OpenAI API key for semantic embeddings (optional) |
|
|
60
|
+
| `threshold` | `0.6` | Minimum similarity score for a chunk to be returned |
|
|
61
|
+
|
|
62
|
+
### Example with custom config
|
|
63
|
+
|
|
64
|
+
```python
|
|
65
|
+
nz = Nzacha(
|
|
66
|
+
chunk_size=150,
|
|
67
|
+
overlap=30,
|
|
68
|
+
openai_api_key="sk-...",
|
|
69
|
+
threshold=0.75
|
|
70
|
+
)
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Embedding Backends
|
|
76
|
+
|
|
77
|
+
`nzachapy` supports two embedding strategies:
|
|
78
|
+
|
|
79
|
+
- **OpenAI** — Provide your `openai_api_key` at initialization to use OpenAI's embedding models for higher-quality semantic search.
|
|
80
|
+
- **Sentence Transformers** *(default)* — If no API key is provided, the module falls back to a local `sentence-transformers` model. No API key required.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## How It Works
|
|
85
|
+
|
|
86
|
+
1. **Chunking** — `add_by_words(data)` splits your string dataset into overlapping word-based chunks. `chunk_size` controls how many words each chunk contains; `overlap` controls how many words are shared between adjacent chunks to preserve context across boundaries.
|
|
87
|
+
|
|
88
|
+
2. **Embedding** — Each chunk is embedded into a vector using either OpenAI or Sentence Transformers.
|
|
89
|
+
|
|
90
|
+
3. **Retrieval** — When you run a query, it is embedded the same way and compared against all chunk embeddings. Only chunks that meet or exceed the `threshold` similarity score are returned.
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## API Reference
|
|
95
|
+
|
|
96
|
+
### `Nzacha(chunk_size, overlap, openai_api_key, threshold)`
|
|
97
|
+
Initializes the context filter with the given configuration.
|
|
98
|
+
|
|
99
|
+
### `nz.add_by_words(data: str)`
|
|
100
|
+
Chunks the provided string dataset by word count and stores the embeddings internally. Must be called before querying.
|
|
101
|
+
|
|
102
|
+
- `data` — Your full dataset as a **string**. Convert lists, dicts, DataFrames, or any other structure to a string first.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## Notes
|
|
107
|
+
|
|
108
|
+
- Always pass data as a plain string. Use `str()`, `json.dumps()`, or `.to_string()` (for DataFrames) to convert before calling `add_by_words()`.
|
|
109
|
+
- A lower `threshold` returns more (but less precise) chunks. A higher threshold is stricter.
|
|
110
|
+
- Chunk `overlap` helps avoid losing context that falls on chunk boundaries — tune it based on your data's structure.
|
|
111
|
+
|
|
112
|
+
---
|
|
113
|
+
|
|
114
|
+
## Contributing
|
|
115
|
+
|
|
116
|
+
Contributions are welcome! If you spot an area for improvement — whether it's a new chunking strategy, a better embedding backend, performance optimizations, or additional retrieval methods — feel free to open an issue or submit a pull request.
|
|
117
|
+
|
|
118
|
+
Please make sure any changes are well-tested and clearly documented.
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## License
|
|
123
|
+
|
|
124
|
+
MIT
|
nzachapy-0.1.0/README.md
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
1
|
+
# nzachapy
|
|
2
|
+
|
|
3
|
+
> **Context Filtering for LLMs** — Scan your entire dataset and return only the portion relevant to your query.
|
|
4
|
+
|
|
5
|
+
When working with large datasets and language models, stuffing everything into the context window is noisy and expensive. `nzachapy` solves this by chunking your data, embedding it semantically, and returning only the chunks most relevant to a given query — so your model sees signal, not noise.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Installation
|
|
10
|
+
|
|
11
|
+
```bash
|
|
12
|
+
pip install nzachapy
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Quick Start
|
|
18
|
+
|
|
19
|
+
```python
|
|
20
|
+
from nzachapy import Nzacha
|
|
21
|
+
|
|
22
|
+
# Initialize the module
|
|
23
|
+
nz = Nzacha()
|
|
24
|
+
|
|
25
|
+
# Add your dataset (must be a string)
|
|
26
|
+
nz.add_by_words(data)
|
|
27
|
+
|
|
28
|
+
# Query for relevant context
|
|
29
|
+
results = nz.query("your query here")
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
> ⚠️ **Important:** Your dataset must be converted to a string before passing it to `add_by_words()`.
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Configuration
|
|
37
|
+
|
|
38
|
+
When initializing `Nzacha`, you can customize the following parameters. If left unset, the defaults are used.
|
|
39
|
+
|
|
40
|
+
| Parameter | Default | Description |
|
|
41
|
+
|---|---|---|
|
|
42
|
+
| `chunk_size` | `200` | Number of words per chunk |
|
|
43
|
+
| `overlap` | `20` | Number of overlapping words between consecutive chunks |
|
|
44
|
+
| `openai_api_key` | `None` | OpenAI API key for semantic embeddings (optional) |
|
|
45
|
+
| `threshold` | `0.6` | Minimum similarity score for a chunk to be returned |
|
|
46
|
+
|
|
47
|
+
### Example with custom config
|
|
48
|
+
|
|
49
|
+
```python
|
|
50
|
+
nz = Nzacha(
|
|
51
|
+
chunk_size=150,
|
|
52
|
+
overlap=30,
|
|
53
|
+
openai_api_key="sk-...",
|
|
54
|
+
threshold=0.75
|
|
55
|
+
)
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
---
|
|
59
|
+
|
|
60
|
+
## Embedding Backends
|
|
61
|
+
|
|
62
|
+
`nzachapy` supports two embedding strategies:
|
|
63
|
+
|
|
64
|
+
- **OpenAI** — Provide your `openai_api_key` at initialization to use OpenAI's embedding models for higher-quality semantic search.
|
|
65
|
+
- **Sentence Transformers** *(default)* — If no API key is provided, the module falls back to a local `sentence-transformers` model. No API key required.
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
## How It Works
|
|
70
|
+
|
|
71
|
+
1. **Chunking** — `add_by_words(data)` splits your string dataset into overlapping word-based chunks. `chunk_size` controls how many words each chunk contains; `overlap` controls how many words are shared between adjacent chunks to preserve context across boundaries.
|
|
72
|
+
|
|
73
|
+
2. **Embedding** — Each chunk is embedded into a vector using either OpenAI or Sentence Transformers.
|
|
74
|
+
|
|
75
|
+
3. **Retrieval** — When you run a query, it is embedded the same way and compared against all chunk embeddings. Only chunks that meet or exceed the `threshold` similarity score are returned.
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
## API Reference
|
|
80
|
+
|
|
81
|
+
### `Nzacha(chunk_size, overlap, openai_api_key, threshold)`
|
|
82
|
+
Initializes the context filter with the given configuration.
|
|
83
|
+
|
|
84
|
+
### `nz.add_by_words(data: str)`
|
|
85
|
+
Chunks the provided string dataset by word count and stores the embeddings internally. Must be called before querying.
|
|
86
|
+
|
|
87
|
+
- `data` — Your full dataset as a **string**. Convert lists, dicts, DataFrames, or any other structure to a string first.
|
|
88
|
+
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
## Notes
|
|
92
|
+
|
|
93
|
+
- Always pass data as a plain string. Use `str()`, `json.dumps()`, or `.to_string()` (for DataFrames) to convert before calling `add_by_words()`.
|
|
94
|
+
- A lower `threshold` returns more (but less precise) chunks. A higher threshold is stricter.
|
|
95
|
+
- Chunk `overlap` helps avoid losing context that falls on chunk boundaries — tune it based on your data's structure.
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## Contributing
|
|
100
|
+
|
|
101
|
+
Contributions are welcome! If you spot an area for improvement — whether it's a new chunking strategy, a better embedding backend, performance optimizations, or additional retrieval methods — feel free to open an issue or submit a pull request.
|
|
102
|
+
|
|
103
|
+
Please make sure any changes are well-tested and clearly documented.
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
## License
|
|
108
|
+
|
|
109
|
+
MIT
|
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
from .sementic import embed, match_embeddings
|
|
2
|
+
|
|
3
|
+
class Nzacha:
|
|
4
|
+
def __init__(self,
|
|
5
|
+
chunk_size = 200,
|
|
6
|
+
overlap = 20,
|
|
7
|
+
openai_api_key: str = None ,
|
|
8
|
+
threshold = 0.6):
|
|
9
|
+
self.chunk_size = chunk_size
|
|
10
|
+
self.overlap = overlap
|
|
11
|
+
self.openai_api_key =openai_api_key
|
|
12
|
+
self.threshold = threshold
|
|
13
|
+
|
|
14
|
+
self.text_chunks = []
|
|
15
|
+
self.embeddings = []
|
|
16
|
+
|
|
17
|
+
def add_by_words(self, data: str):
|
|
18
|
+
data = data.split()
|
|
19
|
+
|
|
20
|
+
if self.overlap >= self.chunk_size:
|
|
21
|
+
raise ValueError("Overlap must be smaller than chunk size")
|
|
22
|
+
start = 0
|
|
23
|
+
while start < len(data):
|
|
24
|
+
end = start + self.chunk_size
|
|
25
|
+
chunk = data[start:end]
|
|
26
|
+
|
|
27
|
+
chunk_text = " ".join(chunk)
|
|
28
|
+
|
|
29
|
+
self.text_chunks.append(chunk_text)
|
|
30
|
+
|
|
31
|
+
embedding = embed(chunk_text, self.openai_api_key)
|
|
32
|
+
self.embeddings.append(embedding)
|
|
33
|
+
|
|
34
|
+
start += (self.chunk_size - self.overlap)
|
|
35
|
+
|
|
36
|
+
return self.text_chunks
|
|
37
|
+
|
|
38
|
+
def search(self, query: str):
|
|
39
|
+
query_embedding = embed(query, self.openai_api_key)
|
|
40
|
+
|
|
41
|
+
related = match_embeddings(self.embeddings, query_embedding, self.threshold)
|
|
42
|
+
|
|
43
|
+
related_data = []
|
|
44
|
+
|
|
45
|
+
for item in related:
|
|
46
|
+
index = item["index"]
|
|
47
|
+
related_data.append(self.text_chunks[index])
|
|
48
|
+
|
|
49
|
+
|
|
50
|
+
if related_data == []:
|
|
51
|
+
return "No strong semantic matches found. Consider lowering threshold."
|
|
52
|
+
else:
|
|
53
|
+
return " ".join(related_data)
|
|
54
|
+
|
|
55
|
+
|
|
56
|
+
|
|
57
|
+
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
from sentence_transformers import SentenceTransformer, util
|
|
2
|
+
import httpx
|
|
3
|
+
from openai import OpenAI
|
|
4
|
+
import torch
|
|
5
|
+
from functools import lru_cache
|
|
6
|
+
|
|
7
|
+
model = SentenceTransformer('all-MiniLM-L6-v2')
|
|
8
|
+
|
|
9
|
+
@lru_cache(maxsize=5000)
|
|
10
|
+
def embed(data, openai_api_key=None):
|
|
11
|
+
if openai_api_key:
|
|
12
|
+
client = OpenAI(api_key=openai_api_key, http_client=httpx.Client(timeout=None))
|
|
13
|
+
|
|
14
|
+
response = client.embeddings.create(
|
|
15
|
+
model='text-embedding-3-small',
|
|
16
|
+
input=data
|
|
17
|
+
)
|
|
18
|
+
return response.data[0].embedding
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
data_embedding = model.encode(data, convert_to_tensor=True).tolist()
|
|
22
|
+
|
|
23
|
+
return data_embedding
|
|
24
|
+
|
|
25
|
+
def match_embeddings(data_embedding_list, query_embedding, threshold):
|
|
26
|
+
data_embeddings = torch.stack([torch.tensor(c) for c in data_embedding_list])
|
|
27
|
+
|
|
28
|
+
similarities = util.cos_sim(query_embedding, data_embeddings)[0]
|
|
29
|
+
|
|
30
|
+
related = [
|
|
31
|
+
{"index": i, "score": float(similarities[i])}
|
|
32
|
+
for i in range(len(data_embedding_list))
|
|
33
|
+
if similarities[i] >= threshold
|
|
34
|
+
]
|
|
35
|
+
|
|
36
|
+
related.sort(key=lambda x: x["score"], reverse=True)
|
|
37
|
+
|
|
38
|
+
return related
|
|
@@ -0,0 +1,124 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: nzachapy
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Context filtering module that scans your data and returns only relevant portions based on a query.
|
|
5
|
+
Author: Leonard Nwosu
|
|
6
|
+
Project-URL: Homepage, https://github.com/Dblvcksheep/nzachapy
|
|
7
|
+
Requires-Python: >=3.8
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: numpy
|
|
11
|
+
Requires-Dist: sentence-transformers
|
|
12
|
+
Requires-Dist: openai
|
|
13
|
+
Requires-Dist: httpx
|
|
14
|
+
Dynamic: license-file
|
|
15
|
+
|
|
16
|
+
# nzachapy
|
|
17
|
+
|
|
18
|
+
> **Context Filtering for LLMs** — Scan your entire dataset and return only the portion relevant to your query.
|
|
19
|
+
|
|
20
|
+
When working with large datasets and language models, stuffing everything into the context window is noisy and expensive. `nzachapy` solves this by chunking your data, embedding it semantically, and returning only the chunks most relevant to a given query — so your model sees signal, not noise.
|
|
21
|
+
|
|
22
|
+
---
|
|
23
|
+
|
|
24
|
+
## Installation
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
pip install nzachapy
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## Quick Start
|
|
33
|
+
|
|
34
|
+
```python
|
|
35
|
+
from nzachapy import Nzacha
|
|
36
|
+
|
|
37
|
+
# Initialize the module
|
|
38
|
+
nz = Nzacha()
|
|
39
|
+
|
|
40
|
+
# Add your dataset (must be a string)
|
|
41
|
+
nz.add_by_words(data)
|
|
42
|
+
|
|
43
|
+
# Query for relevant context
|
|
44
|
+
results = nz.query("your query here")
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
> ⚠️ **Important:** Your dataset must be converted to a string before passing it to `add_by_words()`.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Configuration
|
|
52
|
+
|
|
53
|
+
When initializing `Nzacha`, you can customize the following parameters. If left unset, the defaults are used.
|
|
54
|
+
|
|
55
|
+
| Parameter | Default | Description |
|
|
56
|
+
|---|---|---|
|
|
57
|
+
| `chunk_size` | `200` | Number of words per chunk |
|
|
58
|
+
| `overlap` | `20` | Number of overlapping words between consecutive chunks |
|
|
59
|
+
| `openai_api_key` | `None` | OpenAI API key for semantic embeddings (optional) |
|
|
60
|
+
| `threshold` | `0.6` | Minimum similarity score for a chunk to be returned |
|
|
61
|
+
|
|
62
|
+
### Example with custom config
|
|
63
|
+
|
|
64
|
+
```python
|
|
65
|
+
nz = Nzacha(
|
|
66
|
+
chunk_size=150,
|
|
67
|
+
overlap=30,
|
|
68
|
+
openai_api_key="sk-...",
|
|
69
|
+
threshold=0.75
|
|
70
|
+
)
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Embedding Backends
|
|
76
|
+
|
|
77
|
+
`nzachapy` supports two embedding strategies:
|
|
78
|
+
|
|
79
|
+
- **OpenAI** — Provide your `openai_api_key` at initialization to use OpenAI's embedding models for higher-quality semantic search.
|
|
80
|
+
- **Sentence Transformers** *(default)* — If no API key is provided, the module falls back to a local `sentence-transformers` model. No API key required.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## How It Works
|
|
85
|
+
|
|
86
|
+
1. **Chunking** — `add_by_words(data)` splits your string dataset into overlapping word-based chunks. `chunk_size` controls how many words each chunk contains; `overlap` controls how many words are shared between adjacent chunks to preserve context across boundaries.
|
|
87
|
+
|
|
88
|
+
2. **Embedding** — Each chunk is embedded into a vector using either OpenAI or Sentence Transformers.
|
|
89
|
+
|
|
90
|
+
3. **Retrieval** — When you run a query, it is embedded the same way and compared against all chunk embeddings. Only chunks that meet or exceed the `threshold` similarity score are returned.
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## API Reference
|
|
95
|
+
|
|
96
|
+
### `Nzacha(chunk_size, overlap, openai_api_key, threshold)`
|
|
97
|
+
Initializes the context filter with the given configuration.
|
|
98
|
+
|
|
99
|
+
### `nz.add_by_words(data: str)`
|
|
100
|
+
Chunks the provided string dataset by word count and stores the embeddings internally. Must be called before querying.
|
|
101
|
+
|
|
102
|
+
- `data` — Your full dataset as a **string**. Convert lists, dicts, DataFrames, or any other structure to a string first.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## Notes
|
|
107
|
+
|
|
108
|
+
- Always pass data as a plain string. Use `str()`, `json.dumps()`, or `.to_string()` (for DataFrames) to convert before calling `add_by_words()`.
|
|
109
|
+
- A lower `threshold` returns more (but less precise) chunks. A higher threshold is stricter.
|
|
110
|
+
- Chunk `overlap` helps avoid losing context that falls on chunk boundaries — tune it based on your data's structure.
|
|
111
|
+
|
|
112
|
+
---
|
|
113
|
+
|
|
114
|
+
## Contributing
|
|
115
|
+
|
|
116
|
+
Contributions are welcome! If you spot an area for improvement — whether it's a new chunking strategy, a better embedding backend, performance optimizations, or additional retrieval methods — feel free to open an issue or submit a pull request.
|
|
117
|
+
|
|
118
|
+
Please make sure any changes are well-tested and clearly documented.
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## License
|
|
123
|
+
|
|
124
|
+
MIT
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
LICENSE
|
|
2
|
+
README.md
|
|
3
|
+
pyproject.toml
|
|
4
|
+
nzachapy/__init__.py
|
|
5
|
+
nzachapy/core.py
|
|
6
|
+
nzachapy/sementic.py
|
|
7
|
+
nzachapy.egg-info/PKG-INFO
|
|
8
|
+
nzachapy.egg-info/SOURCES.txt
|
|
9
|
+
nzachapy.egg-info/dependency_links.txt
|
|
10
|
+
nzachapy.egg-info/requires.txt
|
|
11
|
+
nzachapy.egg-info/top_level.txt
|
|
12
|
+
tests/test.py
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
nzachapy
|
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools>=68", "wheel"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[project]
|
|
6
|
+
name = "nzachapy"
|
|
7
|
+
version = "0.1.0"
|
|
8
|
+
description = "Context filtering module that scans your data and returns only relevant portions based on a query."
|
|
9
|
+
readme = "README.md"
|
|
10
|
+
requires-python = ">=3.8"
|
|
11
|
+
authors = [
|
|
12
|
+
{ name = "Leonard Nwosu" }
|
|
13
|
+
]
|
|
14
|
+
|
|
15
|
+
dependencies = [
|
|
16
|
+
"numpy",
|
|
17
|
+
"sentence-transformers",
|
|
18
|
+
"openai",
|
|
19
|
+
"httpx"
|
|
20
|
+
]
|
|
21
|
+
|
|
22
|
+
[project.urls]
|
|
23
|
+
Homepage = "https://github.com/Dblvcksheep/nzachapy"
|
nzachapy-0.1.0/setup.cfg
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
from nzachapy import Nzacha
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
|
|
5
|
+
|
|
6
|
+
data = '''The future without oil! For optimists, a pleasant picture: let’s call it Picture One. Shall we imagine it?
|
|
7
|
+
|
|
8
|
+
There we are, driving around in our cars fueled by hydrogen, or methane, or solar, or something else we have yet to dream up. Goods from afar come to us by solar-and-sail-driven ship — the sails computerized to catch every whiff of air — or else by new versions of the airship, which can lift and carry a huge amount of freight with minimal pollution and no ear-slitting noise. Trains have made a comeback. So have bicycles, when it isn’t snowing; but maybe there won’t be any more winter.
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
We’ve gone back to small-scale hydropower, using fish-friendly dams. We’re eating locally, and even growing organic vegetables on our erstwhile front lawns, watering them with greywater and rainwater, and with the water saved from using low-flush toilets, showers instead of baths, water-saving washing machines, and other appliances already on the market. We’re using low-draw lightbulbs — incandescents have been banned — and energy-efficient heating systems, including pellet stoves, radiant panels, and long underwear. Heat yourself, not the room is no longer a slogan for nutty eccentrics: it’s the way we all live now.
|
|
12
|
+
|
|
13
|
+
|
|
14
|
+
|
|
15
|
+
Due to improved insulation and indoor-climate-enhancing practices, including heatproof blinds and awnings, air-conditioning systems are obsolete, so they no longer suck up huge amounts of power every summer. As for power, in addition to hydro, solar, geothermal, wave, and wind generation, and emissions-free coal plants, we’re using almost foolproof nuclear power. Even when there are accidents it isn’t all bad news, because instant wildlife refuges are created as Nature invades those high-radiation zones where Man now fears to tread. There’s said to be some remarkable wildlife and botany in the area surrounding Chernobyl.
|
|
16
|
+
|
|
17
|
+
|
|
18
|
+
What will we wear? A lot of hemp clothing, I expect: hemp is a hardy fiber source with few pesticide requirements, and cotton will have proven too costly and destructive to grow. We might also be wearing a lot of recycled tinfoil — keeps the heat in — and garments made from the recycled plastic we’ve harvested from the island of it twice the size of Texas currently floating around in the Pacific Ocean. What will we eat, besides our front-lawn vegetables? That may be a problem — we’re coming to the end of cheap fish, and there are other shortages looming. Abundant animal protein in large hunks may have had its day. However, we’re an inventive species, and when push comes to shove we don’t have a lot of fastidiousness: being omnivores, we’ll eat anything as long as there’s ketchup. Looking on the bright side: obesity due to over-eating will no longer be a crisis, and diet plans will not only be free, but mandatory.
|
|
19
|
+
|
|
20
|
+
That’s Picture One. I like it. It’s comforting. Under certain conditions, it might even come true. Sort of. More or less.
|
|
21
|
+
|
|
22
|
+
|
|
23
|
+
Then there’s Picture Two. Suppose the future without oil arrives very quickly. Suppose a bad fairy waves his wand, and poof! Suddenly there’s no oil, anywhere, at all.
|
|
24
|
+
|
|
25
|
+
Everything would immediately come to a halt. No cars, no planes; a few trains still running on hydroelectric, and some bicycles, but that wouldn’t take very many people very far. Food would cease to flow into the cities, water would cease to flow out of the taps. Within hours, panic would set in.
|
|
26
|
+
|
|
27
|
+
|
|
28
|
+
The first result would be the disappearance of the word “we”: except in areas with exceptional organization and leadership, the word “I” would replace it, as the war of all against all sets in. There would be a run on the supermarkets, followed immediately by food riots and looting. There would also be a run on the banks — people would want their money out for black market purchasing, although all currencies would quickly lose value, replaced by bartering. In any case the banks would close: their electronic systems would shut down, and they’d run out of cash.
|
|
29
|
+
|
|
30
|
+
|
|
31
|
+
Having looted and hoarded some food and filled their bathtubs with water, people would hunker down in their houses, creeping out into the backyards if they dared because their toilets would no longer flush. The lights would go out. Communication systems would break down. What next? Open a can of dog food, eat it, then eat the dog, then wait for the authorities to restore order. But the authorities — lacking transport — would be unable to do this.
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
Other authorities would take over. These would at first be known as thugs and street gangs, then as warlords. They’d attack the barricaded houses, raping, pillaging and murdering. But soon even they would run out of stolen food. It wouldn’t take long — given starvation, festering garbage, multiplying rats, and putrefying corpses — for pandemic disease to break out. It will quickly become apparent that the present world population of six and a half billion people is not only dependent on oil, but was created by it: humanity has expanded to fill the space made possible to it by oil, and without that oil it would shrink with astounding rapidity. As for the costs to “the economy,” there won’t be any “economy.” Money will vanish: the only items of exchange will be food, water, and most likely — before everyone topples over — sex.
|
|
35
|
+
|
|
36
|
+
|
|
37
|
+
Picture Two is extreme, and also unlikely, but it exposes the truth: we’re hooked on oil, and without it we can’t do much of anything. And since it’s bound to run out eventually, and since cheap oil is already a thing of the past, we ought to be investing a lot of time, effort, and money in ways to replace it.
|
|
38
|
+
|
|
39
|
+
Unfortunately, like every other species on the planet, we’re conservative: we don’t change our ways unless necessity forces us. The early lungfish didn’t develop lungs because it wanted to be a land animal, but because it wanted to remain a fish even as the dry season drew down the water around it. We’re also self-interested: unless there are laws mandating conservation of energy, most won’t do it, because why make sacrifices if others don’t? The absence of fair and enforceable energy-use rules penalizes the conscientious while enriching the amoral. In business, the laws of competition mean that most corporations will extract maximum riches from available resources with not much thought to the consequences. Why expect any human being or institution to behave otherwise unless they can see clear benefits?
|
|
40
|
+
|
|
41
|
+
|
|
42
|
+
Inaddition to Pictures One and Two, there’s Picture Three. In Picture Three, some countries plan for the future of diminished oil, some don’t. Those planning now include — not strangely — those that don’t have any, or don’t need any. Iceland generates over half its power from abundant geothermal sources: it will not suffer much from an oil dearth. Germany is rapidly converting, as are a number of other oil-poor European countries. They are preparing to weather the coming storm.
|
|
43
|
+
|
|
44
|
+
|
|
45
|
+
Then there are the oil-rich countries. Of these, those who were poor in the past, who got rich quick, and who have no resources other than oil are investing the oil wealth they know to be temporary in technologies they hope will work for them when the oil runs out. But in countries that have oil, but that have other resources too, such foresight is lacking. It does exist in one form: as a Pentagon report of 2003 called “An Abrupt Climate Change Scenario and its Implications for United States National Security” put it, “Nations with the resources to do so may build virtual fortresses around their countries, preserving resources for themselves.” That’s already happening: the walls grow higher and stronger every day.
|
|
46
|
+
|
|
47
|
+
|
|
48
|
+
But the long-term government planning needed to deal with diminishing oil within rich, mixed-resource countries is mostly lacking. Biofuel is largely delusional: the amount of oil required to make it is larger than the payout. Some oil companies are exploring the development of other energy sources, but by and large they’re simply lobbying against anything and anyone that might cause a decrease in consumption and thus impact on their profits. It’s gold-rush time, and oil is the gold, and short-term gain outweighs long-term pain, and madness is afoot, and anyone who wants to stop the rush is deemed an enemy.
|
|
49
|
+
|
|
50
|
+
My own country, Canada, is an oil-rich country. A lot of the oil is in the Athabasca oil sands, where licenses to mine oil are sold to anyone with the cash, and where CO2 is being poured into the atmosphere, not only from the oil used as an end product, but also in the course of its manufacture. Also used in its manufacture is an enormous amount of water. The water mostly comes from the Athabasca River, which is fed by a glacier. But due to global warming, glaciers are melting fast. When they’re gone, no more water, and thus no more oil from oil sands. Maybe we’ll be saved — partially — by our own ineptness. But we’ll leave much destruction in our wake.
|
|
51
|
+
|
|
52
|
+
'''
|
|
53
|
+
query = "effect of cars on climate change"
|
|
54
|
+
nz = Nzacha(threshold=0.3)
|
|
55
|
+
|
|
56
|
+
nz.add_by_words(data)
|
|
57
|
+
|
|
58
|
+
print(nz.search(query))
|