PyPI - fast-csv-loader - Versions diffs - 2.1.0__tar.gz → 2.2.1__tar.gz - Mend

fast-csv-loader 2.1.0tar.gz → 2.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

{fast_csv_loader-2.1.0 → fast_csv_loader-2.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: fast-csv-loader
-Version: 2.1.0
+Version: 2.2.1
 Summary: A fast and memory efficient way to load large CSV files (Timeseries data) into Pandas
 Project-URL: Bug Tracker, https://github.com/BennyThadikaran/fast_csv_loader/issues
 Project-URL: Homepage, https://github.com/BennyThadikaran/fast_csv_loader
@@ -43,6 +43,11 @@ It also improves program execution time, when iterating or loading a large numbe
 **Supports Python >= 3.8**
+> **Note (v2.2.0):** This release introduces `cached_csv_loader`, an optional drop-in caching layer for `csv_loader` that significantly improves performance for repeated file reads. Existing behavior remains unchanged. Users are encouraged to review the updated documentation for details on cache behavior, invalidation, and configuration options.
+>
+> This feature was contributed by GitHub user **sai2311-eng**.
 ## Install
 `pip install fast-csv-loader`
@@ -51,6 +56,45 @@ It also improves program execution time, when iterating or loading a large numbe
 [https://bennythadikaran.github.io/fast_csv_loader/](https://bennythadikaran.github.io/fast_csv_loader/)
+## Cached Loader (mtime-aware)
+For workloads where the same files are read repeatedly — scanners looping
+over symbol CSVs, dashboards re-rendering, rolling backtests — use
+`cached_csv_loader`. It wraps `csv_loader` with an in-memory cache that
+automatically invalidates when the file's modification time changes.
+```python
+from fast_csv_loader import cached_csv_loader, cache_stats, invalidate_all
+from pathlib import Path
+# First call: reads from disk
+df = cached_csv_loader(Path("AAPL.csv"), period=200)
+# Subsequent calls on same file: served from cache (O(1))
+df = cached_csv_loader(Path("AAPL.csv"), period=200)
+# After your EOD job writes new data, the next call auto-invalidates
+# (mtime changed on disk). For explicit control:
+from fast_csv_loader import invalidate
+invalidate("AAPL.csv")     # drop one file
+invalidate_all()           # drop everything
+# Observability
+print(cache_stats())
+# {'hits': 49, 'misses': 1, 'evictions': 0, 'size': 1, 'hit_rate': 98.0, 'max_size': 500}
+```
+Benchmark on 133 small daily CSVs (~12 KB each), 5 repeat passes:
+```
+csv_loader (no cache):         ~555 ms
+cached_csv_loader (warm):       ~13 ms    (~43x faster)
+```
+The cache is process-local and thread-safe. Entries are evicted in
+insertion order when the cache exceeds `max_size` (default 500). Adjust
+with `set_max_cache_size(n)`.
 ## Performance
 Loading a portion of a large file is significantly faster than loading the entire file in memory.

{fast_csv_loader-2.1.0 → fast_csv_loader-2.2.1}/README.md RENAMED Viewed

@@ -14,6 +14,11 @@ It also improves program execution time, when iterating or loading a large numbe
 **Supports Python >= 3.8**
+> **Note (v2.2.0):** This release introduces `cached_csv_loader`, an optional drop-in caching layer for `csv_loader` that significantly improves performance for repeated file reads. Existing behavior remains unchanged. Users are encouraged to review the updated documentation for details on cache behavior, invalidation, and configuration options.
+>
+> This feature was contributed by GitHub user **sai2311-eng**.
 ## Install
 `pip install fast-csv-loader`
@@ -22,6 +27,45 @@ It also improves program execution time, when iterating or loading a large numbe
 [https://bennythadikaran.github.io/fast_csv_loader/](https://bennythadikaran.github.io/fast_csv_loader/)
+## Cached Loader (mtime-aware)
+For workloads where the same files are read repeatedly — scanners looping
+over symbol CSVs, dashboards re-rendering, rolling backtests — use
+`cached_csv_loader`. It wraps `csv_loader` with an in-memory cache that
+automatically invalidates when the file's modification time changes.
+```python
+from fast_csv_loader import cached_csv_loader, cache_stats, invalidate_all
+from pathlib import Path
+# First call: reads from disk
+df = cached_csv_loader(Path("AAPL.csv"), period=200)
+# Subsequent calls on same file: served from cache (O(1))
+df = cached_csv_loader(Path("AAPL.csv"), period=200)
+# After your EOD job writes new data, the next call auto-invalidates
+# (mtime changed on disk). For explicit control:
+from fast_csv_loader import invalidate
+invalidate("AAPL.csv")     # drop one file
+invalidate_all()           # drop everything
+# Observability
+print(cache_stats())
+# {'hits': 49, 'misses': 1, 'evictions': 0, 'size': 1, 'hit_rate': 98.0, 'max_size': 500}
+```
+Benchmark on 133 small daily CSVs (~12 KB each), 5 repeat passes:
+```
+csv_loader (no cache):         ~555 ms
+cached_csv_loader (warm):       ~13 ms    (~43x faster)
+```
+The cache is process-local and thread-safe. Entries are evicted in
+insertion order when the cache exceeds `max_size` (default 500). Adjust
+with `set_max_cache_size(n)`.
 ## Performance
 Loading a portion of a large file is significantly faster than loading the entire file in memory.

fast_csv_loader-2.2.1/fast_csv_loader/__init__.py ADDED Viewed

@@ -0,0 +1,17 @@
+from fast_csv_loader.csv_loader import csv_loader
+from fast_csv_loader.cached_loader import (
+    cached_csv_loader,
+    invalidate,
+    invalidate_all,
+    cache_stats,
+    set_max_cache_size,
+)
+__all__ = [
+    "csv_loader",
+    "cached_csv_loader",
+    "invalidate",
+    "invalidate_all",
+    "cache_stats",
+    "set_max_cache_size",
+]

fast_csv_loader-2.2.1/fast_csv_loader/cached_loader.py ADDED Viewed

@@ -0,0 +1,249 @@
+"""
+cached_loader — mtime-aware in-memory caching wrapper around csv_loader.
+When the same CSV file is read repeatedly (e.g. in a scanner loop, a
+rolling backtest, or a dashboard re-render), re-parsing from disk every
+time is wasteful. This module caches the parsed DataFrame and invalidates
+it automatically when the file's modification time changes.
+Typical use case: a trading scanner loops 50–200 stock CSVs every
+few minutes, the files only change once a day after EOD sync. Without
+caching, every loop re-parses every file.
+Benchmark on 133 small daily CSVs (~12 KB each), 5 repeat passes:
+    csv_loader (no cache):         ~555 ms
+    cached_csv_loader (warm):       ~13 ms    (~43x faster)
+Usage:
+    from fast_csv_loader import cached_csv_loader
+    df = cached_csv_loader(Path("AAPL.csv"), period=200)   # first call: disk
+    df = cached_csv_loader(Path("AAPL.csv"), period=200)   # second call: cached
+    # After writing new data to the file, cache auto-invalidates on next read
+    # because the file mtime changed. For explicit invalidation:
+    from fast_csv_loader import invalidate, invalidate_all, cache_stats
+    invalidate("AAPL.csv")
+    invalidate_all()
+    stats = cache_stats()
+"""
+from __future__ import annotations
+import threading
+from datetime import datetime
+from pathlib import Path
+from typing import Optional
+from collections.abc import Sequence
+import pandas as pd
+from fast_csv_loader.csv_loader import csv_loader
+# Internal cache: absolute_path_str -> (mtime, dataframe)
+_cache: dict = {}
+_cache_lock = threading.Lock()
+_stats = {"hits": 0, "misses": 0, "evictions": 0}
+# Cap cache size to avoid unbounded memory growth in long-running processes.
+# Entries evicted in insertion order (rough LRU — cheaper than a true LRU).
+_MAX_CACHE_ENTRIES = 500
+def _evict_if_full() -> None:
+    if len(_cache) > _MAX_CACHE_ENTRIES:
+        drop_count = max(1, _MAX_CACHE_ENTRIES // 5)
+        for k in list(_cache.keys())[:drop_count]:
+            _cache.pop(k, None)
+            _stats["evictions"] += 1
+def cached_csv_loader(
+    file_path: Path,
+    period: int = 160,
+    end_date: Optional[datetime] = None,
+    date_format: Optional[str] = None,
+    use_columns: Optional[Sequence[str]] = None,
+    chunk_size: int = 1024 * 6,
+) -> pd.DataFrame:
+    """
+    .. versionadded:: 2.2.0
+    Mtime-aware cached wrapper around ``csv_loader``.
+    Provides a drop-in replacement for ``csv_loader`` for cases where the
+    same CSV file may be read multiple times within the same process. Results
+    are cached based on file path, modification time, and selected query
+    parameters.
+    The cache key is composed of (file_path, end_date, date_format,
+    use_columns). The ``period`` parameter is NOT part of the cache key and
+    is applied after cache retrieval, meaning different ``period`` values
+    reuse the same cached DataFrame and only affect the returned slice.
+    If the underlying file has changed (based on mtime), the cache entry is
+    invalidated and the file is reloaded.
+    :param file_path: The path to the CSV file to be loaded.
+    :type file_path: pathlib.Path
+    :param period: Number of rows/candles to return from the end of the
+        dataset. Default is 160.
+    :type period: int
+    :param end_date: Load data up to this timestamp. If None, the most
+        recent data is used. If provided, loading is anchored to this date.
+    :type end_date: Optional[datetime]
+    :param date_format: Custom datetime format string used for parsing the
+        CSV date column if automatic parsing fails.
+    :type date_format: Optional[str]
+    :param use_columns: Default None. A sequence (e.g., list or tuple) of column names to load
+        from the CSV file. If None, all columns are loaded.
+    :type use_columns: Optional[Sequence[str]]
+    :param chunk_size: Size of chunks (in bytes) used when reading the CSV
+        file. Default is 6144 bytes (6 KB).
+    :type chunk_size: int
+    :return: A DataFrame containing the requested slice of timeseries data.
+    :rtype: pd.DataFrame
+    :raise FileNotFoundError: If ``file_path`` does not exist.
+    """
+    if not file_path.exists():
+        raise FileNotFoundError(f"No such file or directory: '{file_path}'")
+    try:
+        mtime = file_path.stat().st_mtime
+    except OSError:
+        return pd.DataFrame()
+    use_columns_key = tuple(use_columns) if use_columns else None
+    cache_key = (str(file_path.resolve()), end_date, date_format, use_columns_key)
+    with _cache_lock:
+        entry = _cache.get(cache_key)
+        if entry and entry[0] == mtime:
+            _stats["hits"] += 1
+            df = entry[1]
+            return (
+                df.iloc[-period:].copy() if period and len(df) > period else df.copy()
+            )
+    # Cache miss — load with enough history that later calls with larger
+    # `period` values can still be served from cache. We load a generous
+    # buffer by passing period * 4 (min 1000) to the underlying loader.
+    load_period = max(period * 4, 1000) if period else 10_000
+    with _cache_lock:
+        _stats["misses"] += 1
+    df = csv_loader(
+        file_path,
+        period=load_period,
+        end_date=end_date,
+        date_format=date_format,
+        use_columns=use_columns,
+        chunk_size=chunk_size,
+    )
+    with _cache_lock:
+        _cache[cache_key] = (mtime, df)
+        _evict_if_full()
+    return df.iloc[-period:].copy() if period and len(df) > period else df.copy()
+def invalidate(file_path) -> int:
+    """
+    .. versionadded:: 2.2.0
+    Drop all cache entries for a given file (any ``end_date`` / columns combination).
+    Useful after writing new data to disk when you want to ensure subsequent
+    reads do not return stale cached results. Otherwise, cache entries are
+    invalidated automatically based on file modification time.
+    :param file_path: Path of the file whose cache entries should be removed.
+    :type file_path: pathlib.Path | str
+    :return: Number of cache entries removed for the given file.
+    :rtype: int
+    """
+    target = str(Path(file_path).resolve())
+    with _cache_lock:
+        keys = [k for k in _cache if k[0] == target]
+        for k in keys:
+            _cache.pop(k, None)
+        return len(keys)
+def invalidate_all() -> int:
+    """
+    .. versionadded:: 2.2.0
+    Drop all entries from the cache.
+    Useful for resetting cache state entirely, for example during testing or
+    after bulk data updates.
+    :return: Number of cache entries removed.
+    :rtype: int
+    """
+    with _cache_lock:
+        n = len(_cache)
+        _cache.clear()
+        return n
+def cache_stats() -> dict:
+    """
+    .. versionadded:: 2.2.0
+    Return cache observability metrics including hit/miss counts, current
+    cache size, and hit rate.
+    :return: Dictionary containing cache statistics:
+        - ``hits``: Number of cache hits
+        - ``misses``: Number of cache misses
+        - ``evictions``: Number of evicted entries
+        - ``size``: Current number of cached entries
+        - ``hit_rate``: Cache hit rate as a percentage (rounded to 1 decimal)
+        - ``max_size``: Maximum allowed cache size
+    :rtype: dict
+    """
+    with _cache_lock:
+        total = _stats["hits"] + _stats["misses"]
+        hit_rate = (_stats["hits"] / total * 100) if total else 0.0
+        return {
+            "hits": _stats["hits"],
+            "misses": _stats["misses"],
+            "evictions": _stats["evictions"],
+            "size": len(_cache),
+            "hit_rate": round(hit_rate, 1),
+            "max_size": _MAX_CACHE_ENTRIES,
+        }
+def set_max_cache_size(n: int) -> None:
+    """
+    .. versionadded:: 2.2.0
+    Set the maximum number of cached entries allowed.
+    If the cache exceeds this size, older entries will be evicted
+    automatically.
+    :param n: New maximum cache size. Must be greater than 0.
+    :type n: int
+    :raise ValueError: If ``n`` is less than or equal to 0.
+    """
+    global _MAX_CACHE_ENTRIES
+    if n <= 0:
+        raise ValueError("max cache size must be > 0")
+    _MAX_CACHE_ENTRIES = int(n)
+    with _cache_lock:
+        _evict_if_full()

{fast_csv_loader-2.1.0 → fast_csv_loader-2.2.1}/fast_csv_loader/csv_loader.py RENAMED Viewed

@@ -2,8 +2,8 @@ import io
 import os
 from datetime import datetime
 from pathlib import Path
-from typing import Optional, List
+from typing import Optional
+from collections.abc import Sequence
 import pandas as pd
@@ -12,7 +12,7 @@ def csv_loader(
     period: int = 160,
     end_date: Optional[datetime] = None,
     date_format: Optional[str] = None,
-    use_columns: Optional[List[str]] = None,
+    use_columns: Optional[Sequence[str]] = None,
     chunk_size: int = 1024 * 6,
 ) -> pd.DataFrame:
     """
@@ -35,8 +35,9 @@ def csv_loader(
     :param date_format: Custom date format in case pandas is unable to parse the date column.
     :type date_format: Optional[str]
-    :param use_columns: Default None. List of column names to load from the CSV file. If None, all columns are loaded.
-    :type use_columns: Optional[List[str]]
+    :param use_columns: Default None. A sequence (e.g., list or tuple) of column names to load
+        from the CSV file. If None, all columns are loaded.
+    :type use_columns: Optional[Sequence[str]]
     :param chunk_size: The size of data chunks loaded into memory.
         The default is 6144 bytes (6 KB).

{fast_csv_loader-2.1.0 → fast_csv_loader-2.2.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ requires = [ "hatchling" ]
 [project]
 name = "fast-csv-loader"
-version = "2.1.0"
+version = "2.2.1"
 description = "A fast and memory efficient way to load large CSV files (Timeseries data) into Pandas"
 readme = "README.md"
 keywords = [ "csv-loader", "csv-reader", "memory-efficient", "pandas-dataframe", "python3" ]
@@ -33,5 +33,5 @@ urls."Bug Tracker" = "https://github.com/BennyThadikaran/fast_csv_loader/issues"
 urls.Homepage = "https://github.com/BennyThadikaran/fast_csv_loader"
 [tool.hatch]
-build.targets.sdist.exclude = [ "docs", "tests", ".github" ]
-build.targets.wheel.exclude = [ "docs", "tests", ".github" ]
+build.targets.wheel.exclude = [ ".github", "docs", "tests" ]
+build.targets.sdist.exclude = [ ".github", "docs", "tests" ]

fast_csv_loader-2.2.1/requirements.txt ADDED Viewed

@@ -0,0 +1,4 @@
+pandas==2.0.3; python_version == "3.8"
+pandas==2.2.2; python_version > "3.8" and python_version < "3.13"
+pandas==2.2.3; python_version >= "3.13" and python_version < "3.14"
+pandas==2.3.3; python_version >= "3.14"

fast_csv_loader-2.1.0/fast_csv_loader/__init__.py DELETED Viewed

	@@ -1 +0,0 @@
1	- from fast_csv_loader.csv_loader import csv_loader

fast_csv_loader-2.1.0/requirements.txt DELETED Viewed

	@@ -1 +0,0 @@
1	- pandas >= 2, <3

{fast_csv_loader-2.1.0 → fast_csv_loader-2.2.1}/.gitignore RENAMED Viewed

File without changes

{fast_csv_loader-2.1.0 → fast_csv_loader-2.2.1}/LICENSE RENAMED Viewed

File without changes

fast-csv-loader 2.1.0__tar.gz → 2.2.1__tar.gz

fast-csv-loader 2.1.0tar.gz → 2.2.1tar.gz