PyPI - storetle - Versions diffs - 0.2.2__tar.gz → 0.3.1__tar.gz - Mend

storetle 0.2.2tar.gz → 0.3.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

{storetle-0.2.2/storetle.egg-info → storetle-0.3.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: storetle
-Version: 0.2.2
+Version: 0.3.1
 Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
 Author-email: Davis Brief <davis@team8.co>
 License: MIT
@@ -98,23 +98,29 @@ storetle train     my_corpus/ --output my.bin     # domain-specific dictionary
 ## Hosted corpora — free
-**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
-in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
-JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
-```
-https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
+```bash
+storetle corpora                                  # list what's available
+storetle get wiki "Albert Einstein" --text        # one article, by name, ~2s
+storetle get wiki-text "Black hole"               # from the clean-text edition
 ```
-Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
+Corpus names resolve through a public registry
+(`https://data.davisbrief.com/corpora.json`) — new corpora appear without a
+package update. Title lookup fetches a small index once and caches it.
-```bash
-storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244            # Albert Einstein, full HTML
-storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text     # …as clean plain text
-```
+**Available now — Simple English Wikipedia, complete** (267,503 articles,
+snapshot 2025-03-20, CC-BY-SA-4.0):
+| edition | size | contents |
+|---|---|---|
+| `wiki` | 843 MB / 6 shards | full article HTML (10.06 GB raw) |
+| `wiki-text` | 196 MB / 1 file | clean plain text, random access |
+| `…jsonl.zst` | 168 MB | `{"title","url","text"}` per line, for ML pipelines |
-Find a title's index by grepping the shard's `.index.jsonl`. More corpora
-(arXiv, PubMed Central OA) coming.
+All under `https://data.davisbrief.com/simplewiki/` with JSONL metadata
+indexes and a SHA-256 manifest. The entire text of Simple English Wikipedia
+in 196 MB, where any article is one ~2 MB range request away — that's the
+point of the format. More corpora (arXiv, PubMed Central OA) coming.
 ## Plain text extraction (v0.2.2)
@@ -133,8 +139,8 @@ server-side code, works against any Range-capable host (R2, S3, GitHub
 Pages, nginx):
 ```bash
-storetle info https://adventurelands.github.io/storetle/sample.storetle
-storetle get  https://adventurelands.github.io/storetle/sample.storetle 4
+storetle info https://data.davisbrief.com/simplewiki/simplewiki-text-20250320.storetle
+storetle get  wiki "Albert Einstein" --text
 ```
 ```python

{storetle-0.2.2 → storetle-0.3.1}/README.md RENAMED Viewed

@@ -75,23 +75,29 @@ storetle train     my_corpus/ --output my.bin     # domain-specific dictionary
 ## Hosted corpora — free
-**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
-in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
-JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
-```
-https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
+```bash
+storetle corpora                                  # list what's available
+storetle get wiki "Albert Einstein" --text        # one article, by name, ~2s
+storetle get wiki-text "Black hole"               # from the clean-text edition
 ```
-Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
+Corpus names resolve through a public registry
+(`https://data.davisbrief.com/corpora.json`) — new corpora appear without a
+package update. Title lookup fetches a small index once and caches it.
-```bash
-storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244            # Albert Einstein, full HTML
-storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text     # …as clean plain text
-```
+**Available now — Simple English Wikipedia, complete** (267,503 articles,
+snapshot 2025-03-20, CC-BY-SA-4.0):
+| edition | size | contents |
+|---|---|---|
+| `wiki` | 843 MB / 6 shards | full article HTML (10.06 GB raw) |
+| `wiki-text` | 196 MB / 1 file | clean plain text, random access |
+| `…jsonl.zst` | 168 MB | `{"title","url","text"}` per line, for ML pipelines |
-Find a title's index by grepping the shard's `.index.jsonl`. More corpora
-(arXiv, PubMed Central OA) coming.
+All under `https://data.davisbrief.com/simplewiki/` with JSONL metadata
+indexes and a SHA-256 manifest. The entire text of Simple English Wikipedia
+in 196 MB, where any article is one ~2 MB range request away — that's the
+point of the format. More corpora (arXiv, PubMed Central OA) coming.
 ## Plain text extraction (v0.2.2)
@@ -110,8 +116,8 @@ server-side code, works against any Range-capable host (R2, S3, GitHub
 Pages, nginx):
 ```bash
-storetle info https://adventurelands.github.io/storetle/sample.storetle
-storetle get  https://adventurelands.github.io/storetle/sample.storetle 4
+storetle info https://data.davisbrief.com/simplewiki/simplewiki-text-20250320.storetle
+storetle get  wiki "Albert Einstein" --text
 ```
 ```python

{storetle-0.2.2 → storetle-0.3.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "storetle"
-version = "0.2.2"
+version = "0.3.1"
 description = "HTML-aware compression for document corpora — solid-archive ratios with random access"
 readme = "README.md"
 requires-python = ">=3.8"

{storetle-0.2.2 → storetle-0.3.1}/storetle/__init__.py RENAMED Viewed

@@ -11,7 +11,7 @@ from .stream import StreamWriter, StreamReader
 from .remote import RemoteReader
 from .folder import pack, unpack
-__version__ = '0.2.2'
+__version__ = '0.3.1'
 __all__ = ['StreamWriter', 'StreamReader', 'RemoteReader', 'pack', 'unpack', 'benchmark']

{storetle-0.2.2 → storetle-0.3.1}/storetle/cli.py RENAMED Viewed

@@ -117,11 +117,23 @@ def cmd_get(args):
     text = '--text' in args
     args = [a for a in args if a != '--text']
     if len(args) < 2:
-        print('Usage: storetle get <file-or-url> <index> [--text]')
+        print('Usage: storetle get <file|url|corpus> <index|title> [--text]\n'
+              '       storetle get wiki "Albert Einstein" --text')
         sys.exit(1)
-    with _open_reader(args[0]) as r:
+    src, ref = args[0], ' '.join(args[1:])
+    if not _is_url(src) and not Path(src).exists():
+        # treat as a named corpus from the public registry
+        from .registry import resolve
+        try:
+            src, ref = resolve(src, ref)
+        except (KeyError, IndexError) as e:
+            print(f'Error: {e}')
+            sys.exit(1)
+    with _open_reader(src) as r:
         try:
-            idx = int(args[1])
+            idx = int(ref)
             doc = r.get_text(idx) if text else r[idx]
             sys.stdout.buffer.write(doc)
             if text:
@@ -131,6 +143,17 @@ def cmd_get(args):
             sys.exit(1)
+def cmd_corpora(args):
+    from .registry import list_corpora
+    print()
+    for name, info in list_corpora().items():
+        print(f'  {name:12s} {info.get("title","")}  '
+              f'[{info.get("docs","?"):,} docs, {info.get("snapshot","")}, '
+              f'{info.get("license","")}]')
+    print('\n  Usage: storetle get <corpus> <title-or-index> [--text]')
+    print()
 def cmd_from_warc(args):
     if len(args) < 2:
         print('Usage: storetle from-warc <input.warc[.gz]> <output.storetle>')
@@ -262,6 +285,7 @@ def cmd_warc_decode(args):
 COMMANDS = {
     'bench':       cmd_bench,
+    'corpora':     cmd_corpora,
     'pack':        cmd_pack,
     'unpack':      cmd_unpack,
     'info':        cmd_info,
@@ -276,11 +300,13 @@ COMMANDS = {
 HELP = """storetle — HTML-aware compression for large document collections
 Commands:
+  corpora                              List free hosted corpora
   bench     <folder>                   Benchmark your HTML data vs gzip WARC
   pack      <folder> <output>          Compress a folder → .storetle file
   unpack    <file-or-url> <out> [--text] Extract → HTML files (or clean .txt)
   info      <file-or-url>               Show file statistics
-  get       <file-or-url> <index>      Extract one doc by index — over HTTP this
+  get       <file|url|corpus> <ref>    Extract one doc by index or title — remote
+                                       reads fetch only the containing ~2MB chunk.
                                        fetches only the containing ~2MB chunk.
                                        Add --text for tag-stripped plain text
   from-warc   <input.warc[.gz]> <out>  Convert WARC → .storetle

storetle-0.3.1/storetle/registry.py ADDED Viewed

@@ -0,0 +1,104 @@
+# registry.py — named corpora: `storetle get wiki "Albert Einstein"`.
+#
+# A corpus registry (corpora.json) lives next to the hosted data, so new
+# corpora appear without a new package release. Title→location maps are
+# fetched once and cached under ~/.cache/storetle/.
+import gzip
+import json
+import time
+import urllib.request
+from pathlib import Path
+REGISTRY_URL = 'https://data.davisbrief.com/corpora.json'
+CACHE_DIR = Path.home() / '.cache' / 'storetle'
+_REGISTRY_TTL = 3600
+def _fetch(url, timeout=30):
+    req = urllib.request.Request(url, headers={'User-Agent': 'storetle-cli'})
+    with urllib.request.urlopen(req, timeout=timeout) as r:
+        return r.read()
+def load_registry():
+    """Fetch corpora.json, with a small on-disk cache."""
+    CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    cache = CACHE_DIR / 'corpora.json'
+    if cache.exists() and time.time() - cache.stat().st_mtime < _REGISTRY_TTL:
+        return json.loads(cache.read_text())
+    try:
+        data = _fetch(REGISTRY_URL)
+        cache.write_bytes(data)
+        return json.loads(data)
+    except Exception:
+        if cache.exists():               # stale beats broken
+            return json.loads(cache.read_text())
+        raise
+def _titles_path(corpus_name, entry):
+    """Download (once) and cache the corpus title map."""
+    CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    local = CACHE_DIR / f'titles-{corpus_name}.tsv.gz'
+    if not local.exists():
+        url = entry['base'].rstrip('/') + '/' + entry['titles']
+        print(f'[storetle] fetching title index for "{corpus_name}" '
+              f'({url.rsplit("/",1)[-1]}, one-time)...')
+        local.write_bytes(_fetch(url))
+    return local
+def _lookup_title(corpus_name, entry, title):
+    """Resolve a title to (shard_no, doc_idx). Exact, then case-insensitive."""
+    want = title.strip()
+    want_ci = want.lower()
+    ci_hit = None
+    with gzip.open(_titles_path(corpus_name, entry), 'rt') as f:
+        for line in f:
+            name, shard, idx = line.rstrip('\n').rsplit('\t', 2)
+            if name == want:
+                return int(shard), int(idx)
+            if ci_hit is None and name.lower() == want_ci:
+                ci_hit = (int(shard), int(idx))
+    if ci_hit:
+        return ci_hit
+    raise KeyError(f'title not found in corpus "{corpus_name}": {title!r}')
+def resolve(corpus_name, ref):
+    """Resolve (corpus, index-or-title) → (shard_url, doc_idx).
+    ref may be an integer global index or a document title.
+    """
+    reg = load_registry()
+    if corpus_name not in reg:
+        raise KeyError(f'unknown corpus {corpus_name!r}; '
+                       f'available: {", ".join(sorted(reg))}')
+    entry = reg[corpus_name]
+    base = entry['base'].rstrip('/')
+    shards = entry['shards']
+    try:
+        gidx = int(ref)
+    except (TypeError, ValueError):
+        shard_no, idx = _lookup_title(corpus_name, entry, ref)
+        return f'{base}/{shards[shard_no]}', idx
+    # integer: map global index onto shards via per-shard doc counts
+    counts = entry.get('shard_docs') or []
+    if not counts:
+        return f'{base}/{shards[0]}', gidx
+    run = 0
+    for shard_no, n in enumerate(counts):
+        if gidx < run + n:
+            return f'{base}/{shards[shard_no]}', gidx - run
+        run += n
+    raise IndexError(f'index {gidx} out of range ({run} docs in corpus)')
+def list_corpora():
+    reg = load_registry()
+    return {name: {k: v for k, v in e.items() if k in
+                   ('title', 'docs', 'snapshot', 'license')}
+            for name, e in sorted(reg.items())}

{storetle-0.2.2 → storetle-0.3.1}/storetle/stream.py RENAMED Viewed

@@ -84,6 +84,18 @@ def _encode_doc(html):
     return struct.pack('>I', len(ss)) + ss + cs
+def _encode_text_doc(text):
+    """Encode a plain-text document as a single text node.
+    Same container, same decoders: HTML readers see the text (escaped),
+    --text / get_text() return it verbatim. This is how text-mode corpora
+    (e.g. clean-text Wikipedia) are stored without any format change."""
+    if isinstance(text, bytes):
+        text = text.decode('utf-8', errors='replace')
+    ss, cs = _build_streams_class_split([(T_TEXT, None, text)])
+    return struct.pack('>I', len(ss)) + ss + cs
 def _decode_doc(raw):
     """Decode a v2 blob back to reconstructed HTML bytes."""
     from .decoder import _Stream
@@ -229,6 +241,23 @@ class StreamWriter:
         else:
             self._append_sync(html)
+    def append_text(self, text):
+        """Encode and buffer one plain-text document (no HTML parsing)."""
+        if self._workers > 1:
+            # preserve document order: settle in-flight HTML encodes first
+            self._drain_all()
+        if isinstance(text, str):
+            data = text.encode('utf-8', errors='replace')
+        else:
+            data = text
+        self._total_orig += len(data)
+        raw = _encode_text_doc(data)
+        self._chunk_buf.append(raw)
+        self._chunk_bytes += len(raw)
+        self._total_docs  += 1
+        if len(self._chunk_buf) >= CHUNK_DOCS or self._chunk_bytes >= CHUNK_BYTES:
+            self._flush_chunk()
     def _append_sync(self, html):
         if isinstance(html, str):
             try:

{storetle-0.2.2 → storetle-0.3.1}/storetle/text.py RENAMED Viewed

@@ -15,7 +15,7 @@ import struct
 from .decoder import _Stream
 from .encoder import (T_OPEN, T_CLOSE, T_TEXT, T_DOCTYPE,
                       T_COMMENT, T_SELFCLOSE, T_RAWTEXT)
-from .vocab import ID_TO_TAG, SHARED_STRINGS, UNKNOWN_ID
+from .vocab import ID_TO_TAG, ID_TO_ATTR, SHARED_STRINGS, UNKNOWN_ID
 _BLOCK_TAGS = frozenset((
     'p', 'div', 'br', 'li', 'ul', 'ol', 'dl', 'dt', 'dd',
@@ -29,6 +29,25 @@ _CELL_TAGS = frozenset(('td', 'th'))
 _collapse_blank = re.compile(r'\n\s*\n+')
 _collapse_space = re.compile(r'[ \t\f\v]+')
+# Elements whose entire subtree is navigation/boilerplate, not content.
+# Matched against class tokens (substring) and role attribute values.
+_SKIP_CLASS_SUBSTR = ('navbox', 'catlinks', 'mw-jump', 'printfooter',
+                      'mw-editsection', 'breadcrumb', 'site-nav')
+_SKIP_ROLES = ('navigation',)
+def _is_boilerplate(attrs):
+    for name, value in attrs:
+        if not value:
+            continue
+        if name == 'role' and value.lower() in _SKIP_ROLES:
+            return True
+        if name == 'class':
+            v = value.lower()
+            if any(s in v for s in _SKIP_CLASS_SUBSTR):
+                return True
+    return False
 def decode_text(raw: bytes) -> bytes:
     """Extract plain text from one encoded document (the blob stored in a
@@ -39,6 +58,8 @@ def decode_text(raw: bytes) -> bytes:
     ss_data_len = ss_len
     out = []
+    depth = 0
+    skip_above = None     # while set, drop text until depth returns here
     def boundary(tag):
         if tag in _BLOCK_TAGS:
@@ -54,19 +75,27 @@ def decode_text(raw: bytes) -> bytes:
             tag = cs.read_string(SHARED_STRINGS) if tag_id == UNKNOWN_ID \
                   else ID_TO_TAG.get(tag_id, '')
             ac = ss.read_byte()
+            attrs = []
             for _ in range(ac):
                 aid = ss.read_byte()
-                if aid == UNKNOWN_ID:
-                    cs.read_string(SHARED_STRINGS)   # attr name — discard
-                cs.read_string(SHARED_STRINGS)       # attr value — discard
-            boundary(tag)
+                aname = cs.read_string(SHARED_STRINGS) if aid == UNKNOWN_ID \
+                        else ID_TO_ATTR.get(aid, '')
+                attrs.append((aname, cs.read_string(SHARED_STRINGS)))
+            if nt == T_OPEN:
+                depth += 1
+                if skip_above is None and _is_boilerplate(attrs):
+                    skip_above = depth - 1
+            if skip_above is None:
+                boundary(tag)
         elif nt == T_CLOSE:
-            pass  # no payload; block boundary handled at open
+            depth = max(0, depth - 1)
+            if skip_above is not None and depth <= skip_above:
+                skip_above = None
         elif nt == T_TEXT:
             t = cs.read_string(SHARED_STRINGS)
-            if t:
+            if t and skip_above is None:
                 out.append(t)
         elif nt in (T_RAWTEXT, T_DOCTYPE, T_COMMENT):
@@ -74,5 +103,5 @@ def decode_text(raw: bytes) -> bytes:
     text = ''.join(out)
     text = _collapse_space.sub(' ', text)
-    text = _collapse_blank.sub('\n', text)
+    text = _collapse_blank.sub('\n\n', text)   # keep paragraph boundaries
     return text.strip().encode('utf-8')

{storetle-0.2.2 → storetle-0.3.1/storetle.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: storetle
-Version: 0.2.2
+Version: 0.3.1
 Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
 Author-email: Davis Brief <davis@team8.co>
 License: MIT
@@ -98,23 +98,29 @@ storetle train     my_corpus/ --output my.bin     # domain-specific dictionary
 ## Hosted corpora — free
-**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
-in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
-JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
-```
-https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
+```bash
+storetle corpora                                  # list what's available
+storetle get wiki "Albert Einstein" --text        # one article, by name, ~2s
+storetle get wiki-text "Black hole"               # from the clean-text edition
 ```
-Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
+Corpus names resolve through a public registry
+(`https://data.davisbrief.com/corpora.json`) — new corpora appear without a
+package update. Title lookup fetches a small index once and caches it.
-```bash
-storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244            # Albert Einstein, full HTML
-storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text     # …as clean plain text
-```
+**Available now — Simple English Wikipedia, complete** (267,503 articles,
+snapshot 2025-03-20, CC-BY-SA-4.0):
+| edition | size | contents |
+|---|---|---|
+| `wiki` | 843 MB / 6 shards | full article HTML (10.06 GB raw) |
+| `wiki-text` | 196 MB / 1 file | clean plain text, random access |
+| `…jsonl.zst` | 168 MB | `{"title","url","text"}` per line, for ML pipelines |
-Find a title's index by grepping the shard's `.index.jsonl`. More corpora
-(arXiv, PubMed Central OA) coming.
+All under `https://data.davisbrief.com/simplewiki/` with JSONL metadata
+indexes and a SHA-256 manifest. The entire text of Simple English Wikipedia
+in 196 MB, where any article is one ~2 MB range request away — that's the
+point of the format. More corpora (arXiv, PubMed Central OA) coming.
 ## Plain text extraction (v0.2.2)
@@ -133,8 +139,8 @@ server-side code, works against any Range-capable host (R2, S3, GitHub
 Pages, nginx):
 ```bash
-storetle info https://adventurelands.github.io/storetle/sample.storetle
-storetle get  https://adventurelands.github.io/storetle/sample.storetle 4
+storetle info https://data.davisbrief.com/simplewiki/simplewiki-text-20250320.storetle
+storetle get  wiki "Albert Einstein" --text
 ```
 ```python

{storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/SOURCES.txt RENAMED Viewed

@@ -8,6 +8,7 @@ storetle/cube_dict_v10.bin
 storetle/decoder.py
 storetle/encoder.py
 storetle/folder.py
+storetle/registry.py
 storetle/remote.py
 storetle/stream.py
 storetle/text.py