ragit 0.7.3__tar.gz → 0.7.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: ragit
3
- Version: 0.7.3
3
+ Version: 0.7.4
4
4
  Summary: Automatic RAG Pattern Optimization Engine
5
5
  Author: RODMENA LIMITED
6
6
  Maintainer-email: RODMENA LIMITED <info@rodmena.co.uk>
@@ -26,6 +26,8 @@ Requires-Dist: pydantic>=2.0.0
26
26
  Requires-Dist: python-dotenv>=1.0.0
27
27
  Requires-Dist: scikit-learn>=1.5.0
28
28
  Requires-Dist: tqdm>=4.66.0
29
+ Requires-Dist: trio>=0.24.0
30
+ Requires-Dist: httpx>=0.27.0
29
31
  Provides-Extra: dev
30
32
  Requires-Dist: ragit[test]; extra == "dev"
31
33
  Requires-Dist: pytest; extra == "dev"
@@ -443,6 +445,77 @@ print(f"Score: {best.score:.3f}")
443
445
 
444
446
  The experiment tests different combinations of chunk sizes, overlaps, and retrieval parameters to find what works best for your content.
445
447
 
448
+ ## Performance Features
449
+
450
+ Ragit includes several optimizations for production workloads:
451
+
452
+ ### Connection Pooling
453
+
454
+ `OllamaProvider` uses HTTP connection pooling via `requests.Session()` for faster sequential requests:
455
+
456
+ ```python
457
+ from ragit.providers import OllamaProvider
458
+
459
+ provider = OllamaProvider()
460
+
461
+ # All requests reuse the same connection pool
462
+ for text in texts:
463
+ provider.embed(text, model="mxbai-embed-large")
464
+
465
+ # Explicitly close when done (optional, auto-closes on garbage collection)
466
+ provider.close()
467
+ ```
468
+
469
+ ### Async Parallel Embedding
470
+
471
+ For large batches, use `embed_batch_async()` with trio for 5-10x faster embedding:
472
+
473
+ ```python
474
+ import trio
475
+ from ragit.providers import OllamaProvider
476
+
477
+ provider = OllamaProvider()
478
+
479
+ async def embed_documents():
480
+ texts = ["doc1...", "doc2...", "doc3...", ...] # hundreds of texts
481
+ embeddings = await provider.embed_batch_async(
482
+ texts,
483
+ model="mxbai-embed-large",
484
+ max_concurrent=10 # Adjust based on server capacity
485
+ )
486
+ return embeddings
487
+
488
+ # Run with trio
489
+ results = trio.run(embed_documents)
490
+ ```
491
+
492
+ ### Embedding Cache
493
+
494
+ Repeated embedding calls are cached automatically (2048 entries LRU):
495
+
496
+ ```python
497
+ from ragit.providers import OllamaProvider
498
+
499
+ provider = OllamaProvider(use_cache=True) # Default
500
+
501
+ # First call hits the API
502
+ provider.embed("Hello world", model="mxbai-embed-large")
503
+
504
+ # Second call returns cached result instantly
505
+ provider.embed("Hello world", model="mxbai-embed-large")
506
+
507
+ # View cache statistics
508
+ print(OllamaProvider.embedding_cache_info())
509
+ # {'hits': 1, 'misses': 1, 'maxsize': 2048, 'currsize': 1}
510
+
511
+ # Clear cache if needed
512
+ OllamaProvider.clear_embedding_cache()
513
+ ```
514
+
515
+ ### Pre-normalized Embeddings
516
+
517
+ Vector similarity uses pre-normalized embeddings, making cosine similarity a simple dot product (O(1) per comparison).
518
+
446
519
  ## API Reference
447
520
 
448
521
  ### Document Loading
@@ -398,6 +398,77 @@ print(f"Score: {best.score:.3f}")
398
398
 
399
399
  The experiment tests different combinations of chunk sizes, overlaps, and retrieval parameters to find what works best for your content.
400
400
 
401
+ ## Performance Features
402
+
403
+ Ragit includes several optimizations for production workloads:
404
+
405
+ ### Connection Pooling
406
+
407
+ `OllamaProvider` uses HTTP connection pooling via `requests.Session()` for faster sequential requests:
408
+
409
+ ```python
410
+ from ragit.providers import OllamaProvider
411
+
412
+ provider = OllamaProvider()
413
+
414
+ # All requests reuse the same connection pool
415
+ for text in texts:
416
+ provider.embed(text, model="mxbai-embed-large")
417
+
418
+ # Explicitly close when done (optional, auto-closes on garbage collection)
419
+ provider.close()
420
+ ```
421
+
422
+ ### Async Parallel Embedding
423
+
424
+ For large batches, use `embed_batch_async()` with trio for 5-10x faster embedding:
425
+
426
+ ```python
427
+ import trio
428
+ from ragit.providers import OllamaProvider
429
+
430
+ provider = OllamaProvider()
431
+
432
+ async def embed_documents():
433
+ texts = ["doc1...", "doc2...", "doc3...", ...] # hundreds of texts
434
+ embeddings = await provider.embed_batch_async(
435
+ texts,
436
+ model="mxbai-embed-large",
437
+ max_concurrent=10 # Adjust based on server capacity
438
+ )
439
+ return embeddings
440
+
441
+ # Run with trio
442
+ results = trio.run(embed_documents)
443
+ ```
444
+
445
+ ### Embedding Cache
446
+
447
+ Repeated embedding calls are cached automatically (2048 entries LRU):
448
+
449
+ ```python
450
+ from ragit.providers import OllamaProvider
451
+
452
+ provider = OllamaProvider(use_cache=True) # Default
453
+
454
+ # First call hits the API
455
+ provider.embed("Hello world", model="mxbai-embed-large")
456
+
457
+ # Second call returns cached result instantly
458
+ provider.embed("Hello world", model="mxbai-embed-large")
459
+
460
+ # View cache statistics
461
+ print(OllamaProvider.embedding_cache_info())
462
+ # {'hits': 1, 'misses': 1, 'maxsize': 2048, 'currsize': 1}
463
+
464
+ # Clear cache if needed
465
+ OllamaProvider.clear_embedding_cache()
466
+ ```
467
+
468
+ ### Pre-normalized Embeddings
469
+
470
+ Vector similarity uses pre-normalized embeddings, making cosine similarity a simple dot product (O(1) per comparison).
471
+
401
472
  ## API Reference
402
473
 
403
474
  ### Document Loading
@@ -38,6 +38,8 @@ dependencies = [
38
38
  "python-dotenv>=1.0.0",
39
39
  "scikit-learn>=1.5.0",
40
40
  "tqdm>=4.66.0",
41
+ "trio>=0.24.0",
42
+ "httpx>=0.27.0",
41
43
  ]
42
44
 
43
45
  [project.urls]
@@ -7,9 +7,19 @@ Ollama provider for LLM and Embedding operations.
7
7
 
8
8
  This provider connects to a local or remote Ollama server.
9
9
  Configuration is loaded from environment variables.
10
+
11
+ Performance optimizations:
12
+ - Connection pooling via requests.Session()
13
+ - Async parallel embedding via trio + httpx
14
+ - LRU cache for repeated embedding queries
10
15
  """
11
16
 
17
+ from functools import lru_cache
18
+ from typing import Any
19
+
20
+ import httpx
12
21
  import requests
22
+ import trio
13
23
 
14
24
  from ragit.config import config
15
25
  from ragit.providers.base import (
@@ -20,10 +30,37 @@ from ragit.providers.base import (
20
30
  )
21
31
 
22
32
 
33
+ # Module-level cache for embeddings (shared across instances)
34
+ @lru_cache(maxsize=2048)
35
+ def _cached_embedding(text: str, model: str, embedding_url: str, timeout: int) -> tuple[float, ...]:
36
+ """Cache embedding results to avoid redundant API calls."""
37
+ # Truncate oversized inputs
38
+ if len(text) > OllamaProvider.MAX_EMBED_CHARS:
39
+ text = text[: OllamaProvider.MAX_EMBED_CHARS]
40
+
41
+ response = requests.post(
42
+ f"{embedding_url}/api/embeddings",
43
+ headers={"Content-Type": "application/json"},
44
+ json={"model": model, "prompt": text},
45
+ timeout=timeout,
46
+ )
47
+ response.raise_for_status()
48
+ data = response.json()
49
+ embedding = data.get("embedding", [])
50
+ if not embedding:
51
+ raise ValueError("Empty embedding returned from Ollama")
52
+ return tuple(embedding)
53
+
54
+
23
55
  class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
24
56
  """
25
57
  Ollama provider for both LLM and Embedding operations.
26
58
 
59
+ Performance features:
60
+ - Connection pooling via requests.Session() for faster sequential requests
61
+ - Async parallel embedding via embed_batch_async() using trio + httpx
62
+ - LRU cache for repeated embedding queries (2048 entries)
63
+
27
64
  Parameters
28
65
  ----------
29
66
  base_url : str, optional
@@ -32,6 +69,8 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
32
69
  API key for authentication (default: from OLLAMA_API_KEY env var)
33
70
  timeout : int, optional
34
71
  Request timeout in seconds (default: from OLLAMA_TIMEOUT env var)
72
+ use_cache : bool, optional
73
+ Enable embedding cache (default: True)
35
74
 
36
75
  Examples
37
76
  --------
@@ -39,12 +78,12 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
39
78
  >>> response = provider.generate("What is RAG?", model="llama3")
40
79
  >>> print(response.text)
41
80
 
42
- >>> embedding = provider.embed("Hello world", model="nomic-embed-text")
43
- >>> print(len(embedding.embedding))
81
+ >>> # Async batch embedding (5-10x faster for large batches)
82
+ >>> embeddings = trio.run(provider.embed_batch_async, texts, "mxbai-embed-large")
44
83
  """
45
84
 
46
85
  # Known embedding model dimensions
47
- EMBEDDING_DIMENSIONS = {
86
+ EMBEDDING_DIMENSIONS: dict[str, int] = {
48
87
  "nomic-embed-text": 768,
49
88
  "nomic-embed-text:latest": 768,
50
89
  "mxbai-embed-large": 1024,
@@ -57,7 +96,7 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
57
96
  }
58
97
 
59
98
  # Max characters per embedding request (safe limit for 512 token models)
60
- MAX_EMBED_CHARS = 1500
99
+ MAX_EMBED_CHARS = 2000
61
100
 
62
101
  def __init__(
63
102
  self,
@@ -65,14 +104,39 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
65
104
  embedding_url: str | None = None,
66
105
  api_key: str | None = None,
67
106
  timeout: int | None = None,
107
+ use_cache: bool = True,
68
108
  ) -> None:
69
109
  self.base_url = (base_url or config.OLLAMA_BASE_URL).rstrip("/")
70
110
  self.embedding_url = (embedding_url or config.OLLAMA_EMBEDDING_URL).rstrip("/")
71
111
  self.api_key = api_key or config.OLLAMA_API_KEY
72
112
  self.timeout = timeout or config.OLLAMA_TIMEOUT
113
+ self.use_cache = use_cache
73
114
  self._current_embed_model: str | None = None
74
115
  self._current_dimensions: int = 768 # default
75
116
 
117
+ # Connection pooling via session
118
+ self._session: requests.Session | None = None
119
+
120
+ @property
121
+ def session(self) -> requests.Session:
122
+ """Lazy-initialized session for connection pooling."""
123
+ if self._session is None:
124
+ self._session = requests.Session()
125
+ self._session.headers.update({"Content-Type": "application/json"})
126
+ if self.api_key:
127
+ self._session.headers.update({"Authorization": f"Bearer {self.api_key}"})
128
+ return self._session
129
+
130
+ def close(self) -> None:
131
+ """Close the session and release resources."""
132
+ if self._session is not None:
133
+ self._session.close()
134
+ self._session = None
135
+
136
+ def __del__(self) -> None:
137
+ """Cleanup on garbage collection."""
138
+ self.close()
139
+
76
140
  def _get_headers(self, include_auth: bool = True) -> dict[str, str]:
77
141
  """Get request headers including authentication if API key is set."""
78
142
  headers = {"Content-Type": "application/json"}
@@ -91,21 +155,19 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
91
155
  def is_available(self) -> bool:
92
156
  """Check if Ollama server is reachable."""
93
157
  try:
94
- response = requests.get(
158
+ response = self.session.get(
95
159
  f"{self.base_url}/api/tags",
96
- headers=self._get_headers(),
97
160
  timeout=5,
98
161
  )
99
162
  return response.status_code == 200
100
163
  except requests.RequestException:
101
164
  return False
102
165
 
103
- def list_models(self) -> list[dict[str, str]]:
166
+ def list_models(self) -> list[dict[str, Any]]:
104
167
  """List available models on the Ollama server."""
105
168
  try:
106
- response = requests.get(
169
+ response = self.session.get(
107
170
  f"{self.base_url}/api/tags",
108
- headers=self._get_headers(),
109
171
  timeout=10,
110
172
  )
111
173
  response.raise_for_status()
@@ -138,9 +200,8 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
138
200
  payload["system"] = system_prompt
139
201
 
140
202
  try:
141
- response = requests.post(
203
+ response = self.session.post(
142
204
  f"{self.base_url}/api/generate",
143
- headers=self._get_headers(),
144
205
  json=payload,
145
206
  timeout=self.timeout,
146
207
  )
@@ -161,33 +222,34 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
161
222
  raise ConnectionError(f"Ollama generate failed: {e}") from e
162
223
 
163
224
  def embed(self, text: str, model: str) -> EmbeddingResponse:
164
- """Generate embedding using Ollama (uses embedding_url, no auth for local)."""
225
+ """Generate embedding using Ollama with optional caching."""
165
226
  self._current_embed_model = model
166
227
  self._current_dimensions = self.EMBEDDING_DIMENSIONS.get(model, 768)
167
228
 
168
- # Truncate oversized inputs to prevent context length errors
169
- if len(text) > self.MAX_EMBED_CHARS:
170
- text = text[: self.MAX_EMBED_CHARS]
171
-
172
229
  try:
173
- response = requests.post(
174
- f"{self.embedding_url}/api/embeddings",
175
- headers=self._get_headers(include_auth=False),
176
- json={"model": model, "prompt": text},
177
- timeout=self.timeout,
178
- )
179
- response.raise_for_status()
180
- data = response.json()
181
-
182
- embedding = data.get("embedding", [])
183
- if not embedding:
184
- raise ValueError("Empty embedding returned from Ollama")
230
+ if self.use_cache:
231
+ # Use cached version
232
+ embedding = _cached_embedding(text, model, self.embedding_url, self.timeout)
233
+ else:
234
+ # Direct call without cache
235
+ truncated = text[: self.MAX_EMBED_CHARS] if len(text) > self.MAX_EMBED_CHARS else text
236
+ response = self.session.post(
237
+ f"{self.embedding_url}/api/embeddings",
238
+ json={"model": model, "prompt": truncated},
239
+ timeout=self.timeout,
240
+ )
241
+ response.raise_for_status()
242
+ data = response.json()
243
+ embedding_list = data.get("embedding", [])
244
+ if not embedding_list:
245
+ raise ValueError("Empty embedding returned from Ollama")
246
+ embedding = tuple(embedding_list)
185
247
 
186
248
  # Update dimensions from actual response
187
249
  self._current_dimensions = len(embedding)
188
250
 
189
251
  return EmbeddingResponse(
190
- embedding=tuple(embedding),
252
+ embedding=embedding,
191
253
  model=model,
192
254
  provider=self.provider_name,
193
255
  dimensions=len(embedding),
@@ -196,7 +258,9 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
196
258
  raise ConnectionError(f"Ollama embed failed: {e}") from e
197
259
 
198
260
  def embed_batch(self, texts: list[str], model: str) -> list[EmbeddingResponse]:
199
- """Generate embeddings for multiple texts (uses embedding_url, no auth for local).
261
+ """Generate embeddings for multiple texts sequentially.
262
+
263
+ For better performance with large batches, use embed_batch_async().
200
264
 
201
265
  Note: Ollama /api/embeddings only supports single prompts, so we loop.
202
266
  """
@@ -206,26 +270,28 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
206
270
  results = []
207
271
  try:
208
272
  for text in texts:
209
- # Truncate oversized inputs to prevent context length errors
210
- if len(text) > self.MAX_EMBED_CHARS:
211
- text = text[: self.MAX_EMBED_CHARS]
212
-
213
- response = requests.post(
214
- f"{self.embedding_url}/api/embeddings",
215
- headers=self._get_headers(include_auth=False),
216
- json={"model": model, "prompt": text},
217
- timeout=self.timeout,
218
- )
219
- response.raise_for_status()
220
- data = response.json()
273
+ # Truncate oversized inputs
274
+ truncated = text[: self.MAX_EMBED_CHARS] if len(text) > self.MAX_EMBED_CHARS else text
275
+
276
+ if self.use_cache:
277
+ embedding = _cached_embedding(truncated, model, self.embedding_url, self.timeout)
278
+ else:
279
+ response = self.session.post(
280
+ f"{self.embedding_url}/api/embeddings",
281
+ json={"model": model, "prompt": truncated},
282
+ timeout=self.timeout,
283
+ )
284
+ response.raise_for_status()
285
+ data = response.json()
286
+ embedding_list = data.get("embedding", [])
287
+ embedding = tuple(embedding_list) if embedding_list else ()
221
288
 
222
- embedding = data.get("embedding", [])
223
289
  if embedding:
224
290
  self._current_dimensions = len(embedding)
225
291
 
226
292
  results.append(
227
293
  EmbeddingResponse(
228
- embedding=tuple(embedding),
294
+ embedding=embedding,
229
295
  model=model,
230
296
  provider=self.provider_name,
231
297
  dimensions=len(embedding),
@@ -235,6 +301,87 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
235
301
  except requests.RequestException as e:
236
302
  raise ConnectionError(f"Ollama batch embed failed: {e}") from e
237
303
 
304
+ async def embed_batch_async(
305
+ self,
306
+ texts: list[str],
307
+ model: str,
308
+ max_concurrent: int = 10,
309
+ ) -> list[EmbeddingResponse]:
310
+ """Generate embeddings for multiple texts in parallel using trio.
311
+
312
+ This method is 5-10x faster than embed_batch() for large batches
313
+ by making concurrent HTTP requests.
314
+
315
+ Parameters
316
+ ----------
317
+ texts : list[str]
318
+ Texts to embed.
319
+ model : str
320
+ Embedding model name.
321
+ max_concurrent : int
322
+ Maximum concurrent requests (default: 10).
323
+ Higher values = faster but more server load.
324
+
325
+ Returns
326
+ -------
327
+ list[EmbeddingResponse]
328
+ Embeddings in the same order as input texts.
329
+
330
+ Examples
331
+ --------
332
+ >>> import trio
333
+ >>> embeddings = trio.run(provider.embed_batch_async, texts, "mxbai-embed-large")
334
+ """
335
+ self._current_embed_model = model
336
+ self._current_dimensions = self.EMBEDDING_DIMENSIONS.get(model, 768)
337
+
338
+ # Results storage (index -> embedding)
339
+ results: dict[int, EmbeddingResponse] = {}
340
+ errors: list[Exception] = []
341
+
342
+ # Semaphore to limit concurrency
343
+ limiter = trio.CapacityLimiter(max_concurrent)
344
+
345
+ async def fetch_embedding(client: httpx.AsyncClient, index: int, text: str) -> None:
346
+ """Fetch a single embedding."""
347
+ async with limiter:
348
+ try:
349
+ # Truncate oversized inputs
350
+ truncated = text[: self.MAX_EMBED_CHARS] if len(text) > self.MAX_EMBED_CHARS else text
351
+
352
+ response = await client.post(
353
+ f"{self.embedding_url}/api/embeddings",
354
+ json={"model": model, "prompt": truncated},
355
+ timeout=self.timeout,
356
+ )
357
+ response.raise_for_status()
358
+ data = response.json()
359
+
360
+ embedding_list = data.get("embedding", [])
361
+ embedding = tuple(embedding_list) if embedding_list else ()
362
+
363
+ if embedding:
364
+ self._current_dimensions = len(embedding)
365
+
366
+ results[index] = EmbeddingResponse(
367
+ embedding=embedding,
368
+ model=model,
369
+ provider=self.provider_name,
370
+ dimensions=len(embedding),
371
+ )
372
+ except Exception as e:
373
+ errors.append(e)
374
+
375
+ async with httpx.AsyncClient() as client, trio.open_nursery() as nursery:
376
+ for i, text in enumerate(texts):
377
+ nursery.start_soon(fetch_embedding, client, i, text)
378
+
379
+ if errors:
380
+ raise ConnectionError(f"Ollama async batch embed failed: {errors[0]}") from errors[0]
381
+
382
+ # Return results in original order
383
+ return [results[i] for i in range(len(texts))]
384
+
238
385
  def chat(
239
386
  self,
240
387
  messages: list[dict[str, str]],
@@ -273,9 +420,8 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
273
420
  }
274
421
 
275
422
  try:
276
- response = requests.post(
423
+ response = self.session.post(
277
424
  f"{self.base_url}/api/chat",
278
- headers=self._get_headers(),
279
425
  json=payload,
280
426
  timeout=self.timeout,
281
427
  )
@@ -293,3 +439,23 @@ class OllamaProvider(BaseLLMProvider, BaseEmbeddingProvider):
293
439
  )
294
440
  except requests.RequestException as e:
295
441
  raise ConnectionError(f"Ollama chat failed: {e}") from e
442
+
443
+ @staticmethod
444
+ def clear_embedding_cache() -> None:
445
+ """Clear the embedding cache."""
446
+ _cached_embedding.cache_clear()
447
+
448
+ @staticmethod
449
+ def embedding_cache_info() -> dict[str, int]:
450
+ """Get embedding cache statistics."""
451
+ info = _cached_embedding.cache_info()
452
+ return {
453
+ "hits": info.hits,
454
+ "misses": info.misses,
455
+ "maxsize": info.maxsize or 0,
456
+ "currsize": info.currsize,
457
+ }
458
+
459
+
460
+ # Export the EMBEDDING_DIMENSIONS for external use
461
+ EMBEDDING_DIMENSIONS = OllamaProvider.EMBEDDING_DIMENSIONS
@@ -2,4 +2,4 @@
2
2
  # Copyright RODMENA LIMITED 2025
3
3
  # SPDX-License-Identifier: Apache-2.0
4
4
  #
5
- __version__ = "0.7.3"
5
+ __version__ = "0.7.4"
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: ragit
3
- Version: 0.7.3
3
+ Version: 0.7.4
4
4
  Summary: Automatic RAG Pattern Optimization Engine
5
5
  Author: RODMENA LIMITED
6
6
  Maintainer-email: RODMENA LIMITED <info@rodmena.co.uk>
@@ -26,6 +26,8 @@ Requires-Dist: pydantic>=2.0.0
26
26
  Requires-Dist: python-dotenv>=1.0.0
27
27
  Requires-Dist: scikit-learn>=1.5.0
28
28
  Requires-Dist: tqdm>=4.66.0
29
+ Requires-Dist: trio>=0.24.0
30
+ Requires-Dist: httpx>=0.27.0
29
31
  Provides-Extra: dev
30
32
  Requires-Dist: ragit[test]; extra == "dev"
31
33
  Requires-Dist: pytest; extra == "dev"
@@ -443,6 +445,77 @@ print(f"Score: {best.score:.3f}")
443
445
 
444
446
  The experiment tests different combinations of chunk sizes, overlaps, and retrieval parameters to find what works best for your content.
445
447
 
448
+ ## Performance Features
449
+
450
+ Ragit includes several optimizations for production workloads:
451
+
452
+ ### Connection Pooling
453
+
454
+ `OllamaProvider` uses HTTP connection pooling via `requests.Session()` for faster sequential requests:
455
+
456
+ ```python
457
+ from ragit.providers import OllamaProvider
458
+
459
+ provider = OllamaProvider()
460
+
461
+ # All requests reuse the same connection pool
462
+ for text in texts:
463
+ provider.embed(text, model="mxbai-embed-large")
464
+
465
+ # Explicitly close when done (optional, auto-closes on garbage collection)
466
+ provider.close()
467
+ ```
468
+
469
+ ### Async Parallel Embedding
470
+
471
+ For large batches, use `embed_batch_async()` with trio for 5-10x faster embedding:
472
+
473
+ ```python
474
+ import trio
475
+ from ragit.providers import OllamaProvider
476
+
477
+ provider = OllamaProvider()
478
+
479
+ async def embed_documents():
480
+ texts = ["doc1...", "doc2...", "doc3...", ...] # hundreds of texts
481
+ embeddings = await provider.embed_batch_async(
482
+ texts,
483
+ model="mxbai-embed-large",
484
+ max_concurrent=10 # Adjust based on server capacity
485
+ )
486
+ return embeddings
487
+
488
+ # Run with trio
489
+ results = trio.run(embed_documents)
490
+ ```
491
+
492
+ ### Embedding Cache
493
+
494
+ Repeated embedding calls are cached automatically (2048 entries LRU):
495
+
496
+ ```python
497
+ from ragit.providers import OllamaProvider
498
+
499
+ provider = OllamaProvider(use_cache=True) # Default
500
+
501
+ # First call hits the API
502
+ provider.embed("Hello world", model="mxbai-embed-large")
503
+
504
+ # Second call returns cached result instantly
505
+ provider.embed("Hello world", model="mxbai-embed-large")
506
+
507
+ # View cache statistics
508
+ print(OllamaProvider.embedding_cache_info())
509
+ # {'hits': 1, 'misses': 1, 'maxsize': 2048, 'currsize': 1}
510
+
511
+ # Clear cache if needed
512
+ OllamaProvider.clear_embedding_cache()
513
+ ```
514
+
515
+ ### Pre-normalized Embeddings
516
+
517
+ Vector similarity uses pre-normalized embeddings, making cosine similarity a simple dot product (O(1) per comparison).
518
+
446
519
  ## API Reference
447
520
 
448
521
  ### Document Loading
@@ -5,6 +5,8 @@ pydantic>=2.0.0
5
5
  python-dotenv>=1.0.0
6
6
  scikit-learn>=1.5.0
7
7
  tqdm>=4.66.0
8
+ trio>=0.24.0
9
+ httpx>=0.27.0
8
10
 
9
11
  [dev]
10
12
  ragit[test]
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes