PyPI - storetle - Versions diffs - 0.2.0__tar.gz → 0.2.2__tar.gz - Mend

storetle 0.2.0tar.gz → 0.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

{storetle-0.2.0/storetle.egg-info → storetle-0.2.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: storetle
-Version: 0.2.0
+Version: 0.2.2
 Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
 Author-email: Davis Brief <davis@team8.co>
 License: MIT
@@ -96,6 +96,53 @@ storetle to-warc   archive.storetle out.warc.gz
 storetle train     my_corpus/ --output my.bin     # domain-specific dictionary
 ```
+## Hosted corpora — free
+**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
+in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
+JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
+```
+https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
+```
+Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
+```bash
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244            # Albert Einstein, full HTML
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text     # …as clean plain text
+```
+Find a title's index by grepping the shard's `.index.jsonl`. More corpora
+(arXiv, PubMed Central OA) coming.
+## Plain text extraction (v0.2.2)
+`--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
+extracts tag-stripped clean text **without re-parsing HTML** — the encoding
+already separates structure from content, so text extraction is a walk over
+the structure opcodes that keeps text nodes, drops script/style bodies, and
+emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
+of readable text.
+## Remote archives (v0.2.1)
+`get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB
+of Range requests; fetching a document downloads only its ~2MB chunk — no
+server-side code, works against any Range-capable host (R2, S3, GitHub
+Pages, nginx):
+```bash
+storetle info https://adventurelands.github.io/storetle/sample.storetle
+storetle get  https://adventurelands.github.io/storetle/sample.storetle 4
+```
+```python
+from storetle import RemoteReader
+with RemoteReader('https://host/corpus.storetle') as r:
+    html = r[42]          # one ~2MB range request
+```
 ## Python API
 ```python

{storetle-0.2.0 → storetle-0.2.2}/README.md RENAMED Viewed

@@ -73,6 +73,53 @@ storetle to-warc   archive.storetle out.warc.gz
 storetle train     my_corpus/ --output my.bin     # domain-specific dictionary
 ```
+## Hosted corpora — free
+**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
+in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
+JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
+```
+https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
+```
+Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
+```bash
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244            # Albert Einstein, full HTML
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text     # …as clean plain text
+```
+Find a title's index by grepping the shard's `.index.jsonl`. More corpora
+(arXiv, PubMed Central OA) coming.
+## Plain text extraction (v0.2.2)
+`--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
+extracts tag-stripped clean text **without re-parsing HTML** — the encoding
+already separates structure from content, so text extraction is a walk over
+the structure opcodes that keeps text nodes, drops script/style bodies, and
+emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
+of readable text.
+## Remote archives (v0.2.1)
+`get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB
+of Range requests; fetching a document downloads only its ~2MB chunk — no
+server-side code, works against any Range-capable host (R2, S3, GitHub
+Pages, nginx):
+```bash
+storetle info https://adventurelands.github.io/storetle/sample.storetle
+storetle get  https://adventurelands.github.io/storetle/sample.storetle 4
+```
+```python
+from storetle import RemoteReader
+with RemoteReader('https://host/corpus.storetle') as r:
+    html = r[42]          # one ~2MB range request
+```
 ## Python API
 ```python

{storetle-0.2.0 → storetle-0.2.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "storetle"
-version = "0.2.0"
+version = "0.2.2"
 description = "HTML-aware compression for document corpora — solid-archive ratios with random access"
 readme = "README.md"
 requires-python = ">=3.8"

{storetle-0.2.0 → storetle-0.2.2}/storetle/__init__.py RENAMED Viewed

@@ -8,10 +8,11 @@
 #   benchmark      — compare storetle vs gzip on your own data
 from .stream import StreamWriter, StreamReader
+from .remote import RemoteReader
 from .folder import pack, unpack
-__version__ = '0.2.0'
-__all__ = ['StreamWriter', 'StreamReader', 'pack', 'unpack', 'benchmark']
+__version__ = '0.2.2'
+__all__ = ['StreamWriter', 'StreamReader', 'RemoteReader', 'pack', 'unpack', 'benchmark']
 def benchmark(folder, quiet=False):

{storetle-0.2.0 → storetle-0.2.2}/storetle/cli.py RENAMED Viewed

@@ -54,19 +54,35 @@ def cmd_pack(args):
     print(f'Output: {output}')
+def _is_url(s):
+    return s.startswith('http://') or s.startswith('https://')
+def _open_reader(src):
+    """Open a local path with StreamReader or a URL with RemoteReader."""
+    if _is_url(src):
+        from .remote import RemoteReader
+        return RemoteReader(src)
+    from .stream import StreamReader
+    return StreamReader(src)
 def cmd_unpack(args):
+    text = '--text' in args
+    args = [a for a in args if a != '--text']
     if len(args) < 2:
-        print('Usage: storetle unpack <file.storetle> <output_folder>')
+        print('Usage: storetle unpack <file-or-url> <output_folder> [--text]')
         sys.exit(1)
-    from .stream import StreamReader
     src = args[0]
     dst = Path(args[1])
     dst.mkdir(parents=True, exist_ok=True)
-    with StreamReader(src) as r:
-        print(f'Extracting {r.doc_count} documents to {dst}/')
-        for i, doc in enumerate(r):
-            out = dst / f'doc_{i:06d}.html'
+    ext = 'txt' if text else 'html'
+    with _open_reader(src) as r:
+        print(f'Extracting {r.doc_count} documents to {dst}/ as .{ext}')
+        docs = r.iter_text() if text else iter(r)
+        for i, doc in enumerate(docs):
+            out = dst / f'doc_{i:06d}.{ext}'
             out.write_bytes(doc)
             if (i + 1) % 100 == 0:
                 print(f'  {i+1}/{r.doc_count}...')
@@ -75,15 +91,20 @@ def cmd_unpack(args):
 def cmd_info(args):
     if not args:
-        print('Usage: storetle info <file.storetle>')
+        print('Usage: storetle info <file-or-url>')
         sys.exit(1)
-    from .stream import StreamReader
     def fmt(n):
         if n < 1048576: return f'{n/1024:.1f}KB'
         return f'{n/1048576:.2f}MB'
-    info = StreamReader.info(args[0])
+    if _is_url(args[0]):
+        from .remote import RemoteReader
+        with RemoteReader(args[0]) as r:
+            info = r.info()
+    else:
+        from .stream import StreamReader
+        info = StreamReader.info(args[0])
     print(f'\n  {args[0]}')
     print(f'  Documents:    {info["docs"]:,}')
     print(f'  Chunks:       {info["chunks"]:,}')
@@ -93,15 +114,18 @@ def cmd_info(args):
 def cmd_get(args):
+    text = '--text' in args
+    args = [a for a in args if a != '--text']
     if len(args) < 2:
-        print('Usage: storetle get <file.storetle> <index>')
+        print('Usage: storetle get <file-or-url> <index> [--text]')
         sys.exit(1)
-    from .stream import StreamReader
-    with StreamReader(args[0]) as r:
+    with _open_reader(args[0]) as r:
         try:
             idx = int(args[1])
-            doc = r[idx]
+            doc = r.get_text(idx) if text else r[idx]
             sys.stdout.buffer.write(doc)
+            if text:
+                sys.stdout.buffer.write(b'\n')
         except (IndexError, ValueError) as e:
             print(f'Error: {e}')
             sys.exit(1)
@@ -254,9 +278,11 @@ HELP = """storetle — HTML-aware compression for large document collections
 Commands:
   bench     <folder>                   Benchmark your HTML data vs gzip WARC
   pack      <folder> <output>          Compress a folder → .storetle file
-  unpack    <file> <output_folder>     Extract a .storetle → HTML files
-  info      <file>                     Show file statistics
-  get       <file> <index>             Extract one document by index (0-based)
+  unpack    <file-or-url> <out> [--text] Extract → HTML files (or clean .txt)
+  info      <file-or-url>               Show file statistics
+  get       <file-or-url> <index>      Extract one doc by index — over HTTP this
+                                       fetches only the containing ~2MB chunk.
+                                       Add --text for tag-stripped plain text
   from-warc   <input.warc[.gz]> <out>  Convert WARC → .storetle
   to-warc     <input.storetle> <out>  Convert .storetle → WARC (or .warc.gz)
   warc-encode <input.warc[.gz]> <out> Encode HTML in-place → valid .warc.gz (smaller, standard format)

storetle-0.2.2/storetle/remote.py ADDED Viewed

@@ -0,0 +1,190 @@
+# remote.py — read .storetle archives over HTTP(S) with Range requests.
+#
+# Opens an archive with at most three small requests (footer+index tail,
+# header, dictionary if embedded), then fetches exactly one chunk span
+# (≤ ~2 MB compressed) per document access. Works against any server or
+# object store that honors Range (S3, R2, GitHub Pages, nginx, ...).
+#
+# Stdlib only — urllib, struct.
+import struct
+import urllib.request
+from pathlib import Path
+from .stream import STREAM_MAGIC, STREAM_VERSION, _decompress, _decode_doc
+_DEFAULT_DICT_PATH = Path(__file__).parent / 'cube_dict_v10.bin'
+# One speculative tail fetch usually captures index + footer in a single
+# round trip (index entries are 14 bytes; 64 KB covers ~4,600 chunks ≈
+# 1.2M documents).
+_TAIL_BYTES = 64 * 1024
+class RemoteReader:
+    """Random-access reader for a .storetle file served over HTTP(S).
+    Usage:
+        with RemoteReader('https://host/corpus.storetle') as r:
+            print(r.doc_count)
+            html = r[42]
+            for doc in r:
+                ...
+    """
+    def __init__(self, url, dictionary=None, timeout=30):
+        self._url = url
+        self._timeout = timeout
+        self._chunk_cache = (None, None)   # (chunk_idx, [decoded raw docs])
+        self.bytes_fetched = 0
+        tail = self._fetch_suffix(_TAIL_BYTES)
+        if len(tail) < 16:
+            raise ValueError('File too small to be a .storetle archive')
+        chunk_count, index_offset = struct.unpack('>QQ', tail[-16:])
+        index_size = chunk_count * 14
+        if index_size + 16 <= len(tail):
+            index_raw = tail[-(index_size + 16):-16]
+        else:
+            index_raw = self._fetch(index_offset, index_offset + index_size - 1)
+        self._index = []
+        for i in range(chunk_count):
+            off, dc, orig = struct.unpack_from('>QHI', index_raw, i * 14)
+            self._index.append((off, dc, orig))
+        # chunk i occupies [offset_i, offset_{i+1}); the last ends at the index
+        self._chunk_ends = [self._index[i + 1][0] for i in range(chunk_count - 1)]
+        self._chunk_ends.append(index_offset)
+        # cumulative doc counts for index lookup
+        self._cum = [0]
+        for _, dc, _ in self._index:
+            self._cum.append(self._cum[-1] + dc)
+        self.doc_count = self._cum[-1]
+        head = self._fetch(0, 8)
+        if head[:4] != STREAM_MAGIC:
+            raise ValueError('Not a .storetle file (magic: %r)' % head[:4])
+        if head[4] != STREAM_VERSION:
+            raise ValueError('Unsupported version %d (reader is v%d)'
+                             % (head[4], STREAM_VERSION))
+        dict_size = struct.unpack('>I', head[5:9])[0]
+        if dictionary is not None:
+            self._dict = dictionary
+        elif dict_size:
+            self._dict = self._fetch(9, 9 + dict_size - 1)
+        else:
+            self._dict = _DEFAULT_DICT_PATH.read_bytes() \
+                if _DEFAULT_DICT_PATH.exists() else b''
+    # -- HTTP plumbing ------------------------------------------------------
+    def _fetch(self, start, end):
+        return self._range_request('bytes=%d-%d' % (start, end))
+    def _fetch_suffix(self, n):
+        return self._range_request('bytes=-%d' % n)
+    def _range_request(self, range_header):
+        req = urllib.request.Request(self._url, headers={
+            'Range': range_header,
+            'User-Agent': 'storetle-remote/0.2.1',
+        })
+        with urllib.request.urlopen(req, timeout=self._timeout) as resp:
+            if resp.status not in (200, 206):
+                raise IOError('HTTP %d for %s' % (resp.status, self._url))
+            if resp.status == 200 and range_header != 'bytes=0-':
+                raise IOError(
+                    'Server ignored Range request — remote access needs a '
+                    'server that supports HTTP Range (got full response)')
+            data = resp.read()
+        self.bytes_fetched += len(data)
+        return data
+    # -- document access ----------------------------------------------------
+    def _load_chunk(self, ci):
+        if self._chunk_cache[0] == ci:
+            return self._chunk_cache[1]
+        off, expect_dc, _ = self._index[ci]
+        raw = self._fetch(off, self._chunk_ends[ci] - 1)
+        dc, _orig, comp_size = struct.unpack_from('>HII', raw, 0)
+        if dc != expect_dc:
+            raise ValueError('Chunk %d header disagrees with index' % ci)
+        sizes = struct.unpack_from('>%dI' % dc, raw, 10)
+        blob = _decompress(raw[10 + dc * 4: 10 + dc * 4 + comp_size], self._dict)
+        docs, pos = [], 0
+        for s in sizes:
+            docs.append(blob[pos:pos + s])
+            pos += s
+        self._chunk_cache = (ci, docs)
+        return docs
+    def _locate(self, idx):
+        lo, hi = 0, len(self._index) - 1
+        while lo < hi:
+            mid = (lo + hi) // 2
+            if self._cum[mid + 1] <= idx:
+                lo = mid + 1
+            else:
+                hi = mid
+        return lo
+    def __len__(self):
+        return self.doc_count
+    def __getitem__(self, idx):
+        if isinstance(idx, slice):
+            return [self[i] for i in range(*idx.indices(self.doc_count))]
+        if idx < 0:
+            idx += self.doc_count
+        if not 0 <= idx < self.doc_count:
+            raise IndexError('doc %d out of range (%d docs)' % (idx, self.doc_count))
+        ci = self._locate(idx)
+        docs = self._load_chunk(ci)
+        return _decode_doc(docs[idx - self._cum[ci]])
+    def __iter__(self):
+        for ci in range(len(self._index)):
+            for raw in self._load_chunk(ci):
+                yield _decode_doc(raw)
+    def get_text(self, idx):
+        """Return extracted plain text (no tags) for a single document."""
+        from .text import decode_text
+        if idx < 0:
+            idx += self.doc_count
+        if not 0 <= idx < self.doc_count:
+            raise IndexError('doc %d out of range (%d docs)' % (idx, self.doc_count))
+        ci = self._locate(idx)
+        return decode_text(self._load_chunk(ci)[idx - self._cum[ci]])
+    def iter_text(self):
+        """Yield extracted plain text for every document, in order."""
+        from .text import decode_text
+        for ci in range(len(self._index)):
+            for raw in self._load_chunk(ci):
+                yield decode_text(raw)
+    def info(self):
+        comp = self._chunk_ends[-1] - self._index[0][0] if self._index else 0
+        return {
+            'docs': self.doc_count,
+            'chunks': len(self._index),
+            'original_bytes': sum(orig for _, _, orig in self._index),
+            'compressed_bytes': comp,
+            'ratio_pct': round(100 * (1 - comp / max(1, sum(o for _, _, o in self._index))), 2),
+        }
+    def close(self):
+        self._chunk_cache = (None, None)
+    def __enter__(self):
+        return self
+    def __exit__(self, *exc):
+        self.close()
+        return False

{storetle-0.2.0 → storetle-0.2.2}/storetle/stream.py RENAMED Viewed

@@ -426,6 +426,19 @@ class StreamReader:
         raw = self._read_chunk(ci)[wi]
         return _decode_doc(raw)
+    def get_text(self, doc_idx: int):
+        """Return extracted plain text (no tags) for a single document."""
+        from .text import decode_text
+        ci, wi = self._locate(doc_idx)
+        return decode_text(self._read_chunk(ci)[wi])
+    def iter_text(self):
+        """Yield extracted plain text for every document, in order."""
+        from .text import decode_text
+        for ci in range(len(self._index)):
+            for raw in self._read_chunk(ci):
+                yield decode_text(raw)
     def __getitem__(self, key):
         if isinstance(key, int):
             return self.get(key)

storetle-0.2.2/storetle/text.py ADDED Viewed

@@ -0,0 +1,78 @@
+# text.py — plain-text extraction straight from the NodeOp encoding.
+#
+# The encoded form already separates structure (struct stream) from content
+# (content stream), so producing clean text never re-parses HTML: walk the
+# opcodes, keep T_TEXT payloads, skip script/style bodies (T_RAWTEXT),
+# comments and doctypes, and emit newlines at block-element boundaries.
+#
+# This consumes the content stream in exact lockstep with stream._decode_doc —
+# every string the HTML decoder would read, this reads too, it just throws
+# most of them away.
+import re
+import struct
+from .decoder import _Stream
+from .encoder import (T_OPEN, T_CLOSE, T_TEXT, T_DOCTYPE,
+                      T_COMMENT, T_SELFCLOSE, T_RAWTEXT)
+from .vocab import ID_TO_TAG, SHARED_STRINGS, UNKNOWN_ID
+_BLOCK_TAGS = frozenset((
+    'p', 'div', 'br', 'li', 'ul', 'ol', 'dl', 'dt', 'dd',
+    'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
+    'table', 'tr', 'caption', 'thead', 'tbody',
+    'section', 'article', 'aside', 'header', 'footer', 'main', 'nav',
+    'blockquote', 'pre', 'figure', 'figcaption', 'hr', 'title',
+))
+_CELL_TAGS = frozenset(('td', 'th'))
+_collapse_blank = re.compile(r'\n\s*\n+')
+_collapse_space = re.compile(r'[ \t\f\v]+')
+def decode_text(raw: bytes) -> bytes:
+    """Extract plain text from one encoded document (the blob stored in a
+    chunk), without reconstructing HTML."""
+    ss_len  = struct.unpack_from('>I', raw, 0)[0]
+    ss      = _Stream(raw[4: 4 + ss_len])
+    cs      = _Stream(raw[4 + ss_len:])
+    ss_data_len = ss_len
+    out = []
+    def boundary(tag):
+        if tag in _BLOCK_TAGS:
+            out.append('\n')
+        elif tag in _CELL_TAGS:
+            out.append('\t')
+    while ss._pos < ss_data_len:
+        nt = ss.read_byte()
+        if nt in (T_OPEN, T_SELFCLOSE):
+            tag_id = ss.read_byte()
+            tag = cs.read_string(SHARED_STRINGS) if tag_id == UNKNOWN_ID \
+                  else ID_TO_TAG.get(tag_id, '')
+            ac = ss.read_byte()
+            for _ in range(ac):
+                aid = ss.read_byte()
+                if aid == UNKNOWN_ID:
+                    cs.read_string(SHARED_STRINGS)   # attr name — discard
+                cs.read_string(SHARED_STRINGS)       # attr value — discard
+            boundary(tag)
+        elif nt == T_CLOSE:
+            pass  # no payload; block boundary handled at open
+        elif nt == T_TEXT:
+            t = cs.read_string(SHARED_STRINGS)
+            if t:
+                out.append(t)
+        elif nt in (T_RAWTEXT, T_DOCTYPE, T_COMMENT):
+            cs.read_string(SHARED_STRINGS)           # script/style/meta — discard
+    text = ''.join(out)
+    text = _collapse_space.sub(' ', text)
+    text = _collapse_blank.sub('\n', text)
+    return text.strip().encode('utf-8')

{storetle-0.2.0 → storetle-0.2.2/storetle.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: storetle
-Version: 0.2.0
+Version: 0.2.2
 Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
 Author-email: Davis Brief <davis@team8.co>
 License: MIT
@@ -96,6 +96,53 @@ storetle to-warc   archive.storetle out.warc.gz
 storetle train     my_corpus/ --output my.bin     # domain-specific dictionary
 ```
+## Hosted corpora — free
+**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
+in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
+JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
+```
+https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
+```
+Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
+```bash
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244            # Albert Einstein, full HTML
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text     # …as clean plain text
+```
+Find a title's index by grepping the shard's `.index.jsonl`. More corpora
+(arXiv, PubMed Central OA) coming.
+## Plain text extraction (v0.2.2)
+`--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
+extracts tag-stripped clean text **without re-parsing HTML** — the encoding
+already separates structure from content, so text extraction is a walk over
+the structure opcodes that keeps text nodes, drops script/style bodies, and
+emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
+of readable text.
+## Remote archives (v0.2.1)
+`get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB
+of Range requests; fetching a document downloads only its ~2MB chunk — no
+server-side code, works against any Range-capable host (R2, S3, GitHub
+Pages, nginx):
+```bash
+storetle info https://adventurelands.github.io/storetle/sample.storetle
+storetle get  https://adventurelands.github.io/storetle/sample.storetle 4
+```
+```python
+from storetle import RemoteReader
+with RemoteReader('https://host/corpus.storetle') as r:
+    html = r[42]          # one ~2MB range request
+```
 ## Python API
 ```python

{storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/SOURCES.txt RENAMED Viewed

@@ -8,7 +8,9 @@ storetle/cube_dict_v10.bin
 storetle/decoder.py
 storetle/encoder.py
 storetle/folder.py
+storetle/remote.py
 storetle/stream.py
+storetle/text.py
 storetle/vocab.py
 storetle/warc.py
 storetle/zstd_compat.py