PyPI - storetle - Versions diffs - 0.2.1__tar.gz → 0.2.2__tar.gz - Mend

storetle 0.2.1tar.gz → 0.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

{storetle-0.2.1/storetle.egg-info → storetle-0.2.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: storetle
-Version: 0.2.1
+Version: 0.2.2
 Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
 Author-email: Davis Brief <davis@team8.co>
 License: MIT
@@ -96,6 +96,35 @@ storetle to-warc   archive.storetle out.warc.gz
 storetle train     my_corpus/ --output my.bin     # domain-specific dictionary
 ```
+## Hosted corpora — free
+**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
+in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
+JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
+```
+https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
+```
+Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
+```bash
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244            # Albert Einstein, full HTML
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text     # …as clean plain text
+```
+Find a title's index by grepping the shard's `.index.jsonl`. More corpora
+(arXiv, PubMed Central OA) coming.
+## Plain text extraction (v0.2.2)
+`--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
+extracts tag-stripped clean text **without re-parsing HTML** — the encoding
+already separates structure from content, so text extraction is a walk over
+the structure opcodes that keeps text nodes, drops script/style bodies, and
+emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
+of readable text.
 ## Remote archives (v0.2.1)
 `get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB

{storetle-0.2.1 → storetle-0.2.2}/README.md RENAMED Viewed

@@ -73,6 +73,35 @@ storetle to-warc   archive.storetle out.warc.gz
 storetle train     my_corpus/ --output my.bin     # domain-specific dictionary
 ```
+## Hosted corpora — free
+**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
+in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
+JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
+```
+https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
+```
+Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
+```bash
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244            # Albert Einstein, full HTML
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text     # …as clean plain text
+```
+Find a title's index by grepping the shard's `.index.jsonl`. More corpora
+(arXiv, PubMed Central OA) coming.
+## Plain text extraction (v0.2.2)
+`--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
+extracts tag-stripped clean text **without re-parsing HTML** — the encoding
+already separates structure from content, so text extraction is a walk over
+the structure opcodes that keeps text nodes, drops script/style bodies, and
+emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
+of readable text.
 ## Remote archives (v0.2.1)
 `get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB

{storetle-0.2.1 → storetle-0.2.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "storetle"
-version = "0.2.1"
+version = "0.2.2"
 description = "HTML-aware compression for document corpora — solid-archive ratios with random access"
 readme = "README.md"
 requires-python = ">=3.8"

{storetle-0.2.1 → storetle-0.2.2}/storetle/__init__.py RENAMED Viewed

@@ -11,7 +11,7 @@ from .stream import StreamWriter, StreamReader
 from .remote import RemoteReader
 from .folder import pack, unpack
-__version__ = '0.2.1'
+__version__ = '0.2.2'
 __all__ = ['StreamWriter', 'StreamReader', 'RemoteReader', 'pack', 'unpack', 'benchmark']

{storetle-0.2.1 → storetle-0.2.2}/storetle/cli.py RENAMED Viewed

@@ -68,17 +68,21 @@ def _open_reader(src):
 def cmd_unpack(args):
+    text = '--text' in args
+    args = [a for a in args if a != '--text']
     if len(args) < 2:
-        print('Usage: storetle unpack <file-or-url> <output_folder>')
+        print('Usage: storetle unpack <file-or-url> <output_folder> [--text]')
         sys.exit(1)
     src = args[0]
     dst = Path(args[1])
     dst.mkdir(parents=True, exist_ok=True)
+    ext = 'txt' if text else 'html'
     with _open_reader(src) as r:
-        print(f'Extracting {r.doc_count} documents to {dst}/')
-        for i, doc in enumerate(r):
-            out = dst / f'doc_{i:06d}.html'
+        print(f'Extracting {r.doc_count} documents to {dst}/ as .{ext}')
+        docs = r.iter_text() if text else iter(r)
+        for i, doc in enumerate(docs):
+            out = dst / f'doc_{i:06d}.{ext}'
             out.write_bytes(doc)
             if (i + 1) % 100 == 0:
                 print(f'  {i+1}/{r.doc_count}...')
@@ -110,14 +114,18 @@ def cmd_info(args):
 def cmd_get(args):
+    text = '--text' in args
+    args = [a for a in args if a != '--text']
     if len(args) < 2:
-        print('Usage: storetle get <file-or-url> <index>')
+        print('Usage: storetle get <file-or-url> <index> [--text]')
         sys.exit(1)
     with _open_reader(args[0]) as r:
         try:
             idx = int(args[1])
-            doc = r[idx]
+            doc = r.get_text(idx) if text else r[idx]
             sys.stdout.buffer.write(doc)
+            if text:
+                sys.stdout.buffer.write(b'\n')
         except (IndexError, ValueError) as e:
             print(f'Error: {e}')
             sys.exit(1)
@@ -270,10 +278,11 @@ HELP = """storetle — HTML-aware compression for large document collections
 Commands:
   bench     <folder>                   Benchmark your HTML data vs gzip WARC
   pack      <folder> <output>          Compress a folder → .storetle file
-  unpack    <file-or-url> <out_folder>  Extract a .storetle → HTML files
+  unpack    <file-or-url> <out> [--text] Extract → HTML files (or clean .txt)
   info      <file-or-url>               Show file statistics
   get       <file-or-url> <index>      Extract one doc by index — over HTTP this
-                                       fetches only the containing ~2MB chunk
+                                       fetches only the containing ~2MB chunk.
+                                       Add --text for tag-stripped plain text
   from-warc   <input.warc[.gz]> <out>  Convert WARC → .storetle
   to-warc     <input.storetle> <out>  Convert .storetle → WARC (or .warc.gz)
   warc-encode <input.warc[.gz]> <out> Encode HTML in-place → valid .warc.gz (smaller, standard format)

{storetle-0.2.1 → storetle-0.2.2}/storetle/remote.py RENAMED Viewed

@@ -152,6 +152,23 @@ class RemoteReader:
             for raw in self._load_chunk(ci):
                 yield _decode_doc(raw)
+    def get_text(self, idx):
+        """Return extracted plain text (no tags) for a single document."""
+        from .text import decode_text
+        if idx < 0:
+            idx += self.doc_count
+        if not 0 <= idx < self.doc_count:
+            raise IndexError('doc %d out of range (%d docs)' % (idx, self.doc_count))
+        ci = self._locate(idx)
+        return decode_text(self._load_chunk(ci)[idx - self._cum[ci]])
+    def iter_text(self):
+        """Yield extracted plain text for every document, in order."""
+        from .text import decode_text
+        for ci in range(len(self._index)):
+            for raw in self._load_chunk(ci):
+                yield decode_text(raw)
     def info(self):
         comp = self._chunk_ends[-1] - self._index[0][0] if self._index else 0
         return {

{storetle-0.2.1 → storetle-0.2.2}/storetle/stream.py RENAMED Viewed

@@ -426,6 +426,19 @@ class StreamReader:
         raw = self._read_chunk(ci)[wi]
         return _decode_doc(raw)
+    def get_text(self, doc_idx: int):
+        """Return extracted plain text (no tags) for a single document."""
+        from .text import decode_text
+        ci, wi = self._locate(doc_idx)
+        return decode_text(self._read_chunk(ci)[wi])
+    def iter_text(self):
+        """Yield extracted plain text for every document, in order."""
+        from .text import decode_text
+        for ci in range(len(self._index)):
+            for raw in self._read_chunk(ci):
+                yield decode_text(raw)
     def __getitem__(self, key):
         if isinstance(key, int):
             return self.get(key)

storetle-0.2.2/storetle/text.py ADDED Viewed

@@ -0,0 +1,78 @@
+# text.py — plain-text extraction straight from the NodeOp encoding.
+#
+# The encoded form already separates structure (struct stream) from content
+# (content stream), so producing clean text never re-parses HTML: walk the
+# opcodes, keep T_TEXT payloads, skip script/style bodies (T_RAWTEXT),
+# comments and doctypes, and emit newlines at block-element boundaries.
+#
+# This consumes the content stream in exact lockstep with stream._decode_doc —
+# every string the HTML decoder would read, this reads too, it just throws
+# most of them away.
+import re
+import struct
+from .decoder import _Stream
+from .encoder import (T_OPEN, T_CLOSE, T_TEXT, T_DOCTYPE,
+                      T_COMMENT, T_SELFCLOSE, T_RAWTEXT)
+from .vocab import ID_TO_TAG, SHARED_STRINGS, UNKNOWN_ID
+_BLOCK_TAGS = frozenset((
+    'p', 'div', 'br', 'li', 'ul', 'ol', 'dl', 'dt', 'dd',
+    'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
+    'table', 'tr', 'caption', 'thead', 'tbody',
+    'section', 'article', 'aside', 'header', 'footer', 'main', 'nav',
+    'blockquote', 'pre', 'figure', 'figcaption', 'hr', 'title',
+))
+_CELL_TAGS = frozenset(('td', 'th'))
+_collapse_blank = re.compile(r'\n\s*\n+')
+_collapse_space = re.compile(r'[ \t\f\v]+')
+def decode_text(raw: bytes) -> bytes:
+    """Extract plain text from one encoded document (the blob stored in a
+    chunk), without reconstructing HTML."""
+    ss_len  = struct.unpack_from('>I', raw, 0)[0]
+    ss      = _Stream(raw[4: 4 + ss_len])
+    cs      = _Stream(raw[4 + ss_len:])
+    ss_data_len = ss_len
+    out = []
+    def boundary(tag):
+        if tag in _BLOCK_TAGS:
+            out.append('\n')
+        elif tag in _CELL_TAGS:
+            out.append('\t')
+    while ss._pos < ss_data_len:
+        nt = ss.read_byte()
+        if nt in (T_OPEN, T_SELFCLOSE):
+            tag_id = ss.read_byte()
+            tag = cs.read_string(SHARED_STRINGS) if tag_id == UNKNOWN_ID \
+                  else ID_TO_TAG.get(tag_id, '')
+            ac = ss.read_byte()
+            for _ in range(ac):
+                aid = ss.read_byte()
+                if aid == UNKNOWN_ID:
+                    cs.read_string(SHARED_STRINGS)   # attr name — discard
+                cs.read_string(SHARED_STRINGS)       # attr value — discard
+            boundary(tag)
+        elif nt == T_CLOSE:
+            pass  # no payload; block boundary handled at open
+        elif nt == T_TEXT:
+            t = cs.read_string(SHARED_STRINGS)
+            if t:
+                out.append(t)
+        elif nt in (T_RAWTEXT, T_DOCTYPE, T_COMMENT):
+            cs.read_string(SHARED_STRINGS)           # script/style/meta — discard
+    text = ''.join(out)
+    text = _collapse_space.sub(' ', text)
+    text = _collapse_blank.sub('\n', text)
+    return text.strip().encode('utf-8')

{storetle-0.2.1 → storetle-0.2.2/storetle.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: storetle
-Version: 0.2.1
+Version: 0.2.2
 Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
 Author-email: Davis Brief <davis@team8.co>
 License: MIT
@@ -96,6 +96,35 @@ storetle to-warc   archive.storetle out.warc.gz
 storetle train     my_corpus/ --output my.bin     # domain-specific dictionary
 ```
+## Hosted corpora — free
+**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
+in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
+JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
+```
+https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
+```
+Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
+```bash
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244            # Albert Einstein, full HTML
+storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text     # …as clean plain text
+```
+Find a title's index by grepping the shard's `.index.jsonl`. More corpora
+(arXiv, PubMed Central OA) coming.
+## Plain text extraction (v0.2.2)
+`--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
+extracts tag-stripped clean text **without re-parsing HTML** — the encoding
+already separates structure from content, so text extraction is a walk over
+the structure opcodes that keeps text nodes, drops script/style bodies, and
+emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
+of readable text.
 ## Remote archives (v0.2.1)
 `get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB

{storetle-0.2.1 → storetle-0.2.2}/storetle.egg-info/SOURCES.txt RENAMED Viewed

@@ -10,6 +10,7 @@ storetle/encoder.py
 storetle/folder.py
 storetle/remote.py
 storetle/stream.py
+storetle/text.py
 storetle/vocab.py
 storetle/warc.py
 storetle/zstd_compat.py