storetle 0.2.0__tar.gz → 0.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (24) hide show
  1. {storetle-0.2.0/storetle.egg-info → storetle-0.2.2}/PKG-INFO +48 -1
  2. {storetle-0.2.0 → storetle-0.2.2}/README.md +47 -0
  3. {storetle-0.2.0 → storetle-0.2.2}/pyproject.toml +1 -1
  4. {storetle-0.2.0 → storetle-0.2.2}/storetle/__init__.py +3 -2
  5. {storetle-0.2.0 → storetle-0.2.2}/storetle/cli.py +42 -16
  6. storetle-0.2.2/storetle/remote.py +190 -0
  7. {storetle-0.2.0 → storetle-0.2.2}/storetle/stream.py +13 -0
  8. storetle-0.2.2/storetle/text.py +78 -0
  9. {storetle-0.2.0 → storetle-0.2.2/storetle.egg-info}/PKG-INFO +48 -1
  10. {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/SOURCES.txt +2 -0
  11. {storetle-0.2.0 → storetle-0.2.2}/LICENSE +0 -0
  12. {storetle-0.2.0 → storetle-0.2.2}/setup.cfg +0 -0
  13. {storetle-0.2.0 → storetle-0.2.2}/storetle/brotli_compat.py +0 -0
  14. {storetle-0.2.0 → storetle-0.2.2}/storetle/cube_dict_v10.bin +0 -0
  15. {storetle-0.2.0 → storetle-0.2.2}/storetle/decoder.py +0 -0
  16. {storetle-0.2.0 → storetle-0.2.2}/storetle/encoder.py +0 -0
  17. {storetle-0.2.0 → storetle-0.2.2}/storetle/folder.py +0 -0
  18. {storetle-0.2.0 → storetle-0.2.2}/storetle/vocab.py +0 -0
  19. {storetle-0.2.0 → storetle-0.2.2}/storetle/warc.py +0 -0
  20. {storetle-0.2.0 → storetle-0.2.2}/storetle/zstd_compat.py +0 -0
  21. {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/dependency_links.txt +0 -0
  22. {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/entry_points.txt +0 -0
  23. {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/requires.txt +0 -0
  24. {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: storetle
3
- Version: 0.2.0
3
+ Version: 0.2.2
4
4
  Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
5
5
  Author-email: Davis Brief <davis@team8.co>
6
6
  License: MIT
@@ -96,6 +96,53 @@ storetle to-warc archive.storetle out.warc.gz
96
96
  storetle train my_corpus/ --output my.bin # domain-specific dictionary
97
97
  ```
98
98
 
99
+ ## Hosted corpora — free
100
+
101
+ **Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
102
+ in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
103
+ JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
104
+
105
+ ```
106
+ https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
107
+ ```
108
+
109
+ Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
110
+
111
+ ```bash
112
+ storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 # Albert Einstein, full HTML
113
+ storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text # …as clean plain text
114
+ ```
115
+
116
+ Find a title's index by grepping the shard's `.index.jsonl`. More corpora
117
+ (arXiv, PubMed Central OA) coming.
118
+
119
+ ## Plain text extraction (v0.2.2)
120
+
121
+ `--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
122
+ extracts tag-stripped clean text **without re-parsing HTML** — the encoding
123
+ already separates structure from content, so text extraction is a walk over
124
+ the structure opcodes that keeps text nodes, drops script/style bodies, and
125
+ emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
126
+ of readable text.
127
+
128
+ ## Remote archives (v0.2.1)
129
+
130
+ `get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB
131
+ of Range requests; fetching a document downloads only its ~2MB chunk — no
132
+ server-side code, works against any Range-capable host (R2, S3, GitHub
133
+ Pages, nginx):
134
+
135
+ ```bash
136
+ storetle info https://adventurelands.github.io/storetle/sample.storetle
137
+ storetle get https://adventurelands.github.io/storetle/sample.storetle 4
138
+ ```
139
+
140
+ ```python
141
+ from storetle import RemoteReader
142
+ with RemoteReader('https://host/corpus.storetle') as r:
143
+ html = r[42] # one ~2MB range request
144
+ ```
145
+
99
146
  ## Python API
100
147
 
101
148
  ```python
@@ -73,6 +73,53 @@ storetle to-warc archive.storetle out.warc.gz
73
73
  storetle train my_corpus/ --output my.bin # domain-specific dictionary
74
74
  ```
75
75
 
76
+ ## Hosted corpora — free
77
+
78
+ **Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
79
+ in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
80
+ JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
81
+
82
+ ```
83
+ https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
84
+ ```
85
+
86
+ Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
87
+
88
+ ```bash
89
+ storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 # Albert Einstein, full HTML
90
+ storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text # …as clean plain text
91
+ ```
92
+
93
+ Find a title's index by grepping the shard's `.index.jsonl`. More corpora
94
+ (arXiv, PubMed Central OA) coming.
95
+
96
+ ## Plain text extraction (v0.2.2)
97
+
98
+ `--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
99
+ extracts tag-stripped clean text **without re-parsing HTML** — the encoding
100
+ already separates structure from content, so text extraction is a walk over
101
+ the structure opcodes that keeps text nodes, drops script/style bodies, and
102
+ emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
103
+ of readable text.
104
+
105
+ ## Remote archives (v0.2.1)
106
+
107
+ `get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB
108
+ of Range requests; fetching a document downloads only its ~2MB chunk — no
109
+ server-side code, works against any Range-capable host (R2, S3, GitHub
110
+ Pages, nginx):
111
+
112
+ ```bash
113
+ storetle info https://adventurelands.github.io/storetle/sample.storetle
114
+ storetle get https://adventurelands.github.io/storetle/sample.storetle 4
115
+ ```
116
+
117
+ ```python
118
+ from storetle import RemoteReader
119
+ with RemoteReader('https://host/corpus.storetle') as r:
120
+ html = r[42] # one ~2MB range request
121
+ ```
122
+
76
123
  ## Python API
77
124
 
78
125
  ```python
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "storetle"
7
- version = "0.2.0"
7
+ version = "0.2.2"
8
8
  description = "HTML-aware compression for document corpora — solid-archive ratios with random access"
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.8"
@@ -8,10 +8,11 @@
8
8
  # benchmark — compare storetle vs gzip on your own data
9
9
 
10
10
  from .stream import StreamWriter, StreamReader
11
+ from .remote import RemoteReader
11
12
  from .folder import pack, unpack
12
13
 
13
- __version__ = '0.2.0'
14
- __all__ = ['StreamWriter', 'StreamReader', 'pack', 'unpack', 'benchmark']
14
+ __version__ = '0.2.2'
15
+ __all__ = ['StreamWriter', 'StreamReader', 'RemoteReader', 'pack', 'unpack', 'benchmark']
15
16
 
16
17
 
17
18
  def benchmark(folder, quiet=False):
@@ -54,19 +54,35 @@ def cmd_pack(args):
54
54
  print(f'Output: {output}')
55
55
 
56
56
 
57
+ def _is_url(s):
58
+ return s.startswith('http://') or s.startswith('https://')
59
+
60
+
61
+ def _open_reader(src):
62
+ """Open a local path with StreamReader or a URL with RemoteReader."""
63
+ if _is_url(src):
64
+ from .remote import RemoteReader
65
+ return RemoteReader(src)
66
+ from .stream import StreamReader
67
+ return StreamReader(src)
68
+
69
+
57
70
  def cmd_unpack(args):
71
+ text = '--text' in args
72
+ args = [a for a in args if a != '--text']
58
73
  if len(args) < 2:
59
- print('Usage: storetle unpack <file.storetle> <output_folder>')
74
+ print('Usage: storetle unpack <file-or-url> <output_folder> [--text]')
60
75
  sys.exit(1)
61
- from .stream import StreamReader
62
76
  src = args[0]
63
77
  dst = Path(args[1])
64
78
  dst.mkdir(parents=True, exist_ok=True)
65
79
 
66
- with StreamReader(src) as r:
67
- print(f'Extracting {r.doc_count} documents to {dst}/')
68
- for i, doc in enumerate(r):
69
- out = dst / f'doc_{i:06d}.html'
80
+ ext = 'txt' if text else 'html'
81
+ with _open_reader(src) as r:
82
+ print(f'Extracting {r.doc_count} documents to {dst}/ as .{ext}')
83
+ docs = r.iter_text() if text else iter(r)
84
+ for i, doc in enumerate(docs):
85
+ out = dst / f'doc_{i:06d}.{ext}'
70
86
  out.write_bytes(doc)
71
87
  if (i + 1) % 100 == 0:
72
88
  print(f' {i+1}/{r.doc_count}...')
@@ -75,15 +91,20 @@ def cmd_unpack(args):
75
91
 
76
92
  def cmd_info(args):
77
93
  if not args:
78
- print('Usage: storetle info <file.storetle>')
94
+ print('Usage: storetle info <file-or-url>')
79
95
  sys.exit(1)
80
- from .stream import StreamReader
81
96
 
82
97
  def fmt(n):
83
98
  if n < 1048576: return f'{n/1024:.1f}KB'
84
99
  return f'{n/1048576:.2f}MB'
85
100
 
86
- info = StreamReader.info(args[0])
101
+ if _is_url(args[0]):
102
+ from .remote import RemoteReader
103
+ with RemoteReader(args[0]) as r:
104
+ info = r.info()
105
+ else:
106
+ from .stream import StreamReader
107
+ info = StreamReader.info(args[0])
87
108
  print(f'\n {args[0]}')
88
109
  print(f' Documents: {info["docs"]:,}')
89
110
  print(f' Chunks: {info["chunks"]:,}')
@@ -93,15 +114,18 @@ def cmd_info(args):
93
114
 
94
115
 
95
116
  def cmd_get(args):
117
+ text = '--text' in args
118
+ args = [a for a in args if a != '--text']
96
119
  if len(args) < 2:
97
- print('Usage: storetle get <file.storetle> <index>')
120
+ print('Usage: storetle get <file-or-url> <index> [--text]')
98
121
  sys.exit(1)
99
- from .stream import StreamReader
100
- with StreamReader(args[0]) as r:
122
+ with _open_reader(args[0]) as r:
101
123
  try:
102
124
  idx = int(args[1])
103
- doc = r[idx]
125
+ doc = r.get_text(idx) if text else r[idx]
104
126
  sys.stdout.buffer.write(doc)
127
+ if text:
128
+ sys.stdout.buffer.write(b'\n')
105
129
  except (IndexError, ValueError) as e:
106
130
  print(f'Error: {e}')
107
131
  sys.exit(1)
@@ -254,9 +278,11 @@ HELP = """storetle — HTML-aware compression for large document collections
254
278
  Commands:
255
279
  bench <folder> Benchmark your HTML data vs gzip WARC
256
280
  pack <folder> <output> Compress a folder → .storetle file
257
- unpack <file> <output_folder> Extract a .storetle → HTML files
258
- info <file> Show file statistics
259
- get <file> <index> Extract one document by index (0-based)
281
+ unpack <file-or-url> <out> [--text] Extract → HTML files (or clean .txt)
282
+ info <file-or-url> Show file statistics
283
+ get <file-or-url> <index> Extract one doc by index — over HTTP this
284
+ fetches only the containing ~2MB chunk.
285
+ Add --text for tag-stripped plain text
260
286
  from-warc <input.warc[.gz]> <out> Convert WARC → .storetle
261
287
  to-warc <input.storetle> <out> Convert .storetle → WARC (or .warc.gz)
262
288
  warc-encode <input.warc[.gz]> <out> Encode HTML in-place → valid .warc.gz (smaller, standard format)
@@ -0,0 +1,190 @@
1
+ # remote.py — read .storetle archives over HTTP(S) with Range requests.
2
+ #
3
+ # Opens an archive with at most three small requests (footer+index tail,
4
+ # header, dictionary if embedded), then fetches exactly one chunk span
5
+ # (≤ ~2 MB compressed) per document access. Works against any server or
6
+ # object store that honors Range (S3, R2, GitHub Pages, nginx, ...).
7
+ #
8
+ # Stdlib only — urllib, struct.
9
+
10
+ import struct
11
+ import urllib.request
12
+ from pathlib import Path
13
+
14
+ from .stream import STREAM_MAGIC, STREAM_VERSION, _decompress, _decode_doc
15
+
16
+ _DEFAULT_DICT_PATH = Path(__file__).parent / 'cube_dict_v10.bin'
17
+
18
+ # One speculative tail fetch usually captures index + footer in a single
19
+ # round trip (index entries are 14 bytes; 64 KB covers ~4,600 chunks ≈
20
+ # 1.2M documents).
21
+ _TAIL_BYTES = 64 * 1024
22
+
23
+
24
+ class RemoteReader:
25
+ """Random-access reader for a .storetle file served over HTTP(S).
26
+
27
+ Usage:
28
+ with RemoteReader('https://host/corpus.storetle') as r:
29
+ print(r.doc_count)
30
+ html = r[42]
31
+ for doc in r:
32
+ ...
33
+ """
34
+
35
+ def __init__(self, url, dictionary=None, timeout=30):
36
+ self._url = url
37
+ self._timeout = timeout
38
+ self._chunk_cache = (None, None) # (chunk_idx, [decoded raw docs])
39
+ self.bytes_fetched = 0
40
+
41
+ tail = self._fetch_suffix(_TAIL_BYTES)
42
+ if len(tail) < 16:
43
+ raise ValueError('File too small to be a .storetle archive')
44
+ chunk_count, index_offset = struct.unpack('>QQ', tail[-16:])
45
+
46
+ index_size = chunk_count * 14
47
+ if index_size + 16 <= len(tail):
48
+ index_raw = tail[-(index_size + 16):-16]
49
+ else:
50
+ index_raw = self._fetch(index_offset, index_offset + index_size - 1)
51
+
52
+ self._index = []
53
+ for i in range(chunk_count):
54
+ off, dc, orig = struct.unpack_from('>QHI', index_raw, i * 14)
55
+ self._index.append((off, dc, orig))
56
+
57
+ # chunk i occupies [offset_i, offset_{i+1}); the last ends at the index
58
+ self._chunk_ends = [self._index[i + 1][0] for i in range(chunk_count - 1)]
59
+ self._chunk_ends.append(index_offset)
60
+
61
+ # cumulative doc counts for index lookup
62
+ self._cum = [0]
63
+ for _, dc, _ in self._index:
64
+ self._cum.append(self._cum[-1] + dc)
65
+ self.doc_count = self._cum[-1]
66
+
67
+ head = self._fetch(0, 8)
68
+ if head[:4] != STREAM_MAGIC:
69
+ raise ValueError('Not a .storetle file (magic: %r)' % head[:4])
70
+ if head[4] != STREAM_VERSION:
71
+ raise ValueError('Unsupported version %d (reader is v%d)'
72
+ % (head[4], STREAM_VERSION))
73
+ dict_size = struct.unpack('>I', head[5:9])[0]
74
+
75
+ if dictionary is not None:
76
+ self._dict = dictionary
77
+ elif dict_size:
78
+ self._dict = self._fetch(9, 9 + dict_size - 1)
79
+ else:
80
+ self._dict = _DEFAULT_DICT_PATH.read_bytes() \
81
+ if _DEFAULT_DICT_PATH.exists() else b''
82
+
83
+ # -- HTTP plumbing ------------------------------------------------------
84
+
85
+ def _fetch(self, start, end):
86
+ return self._range_request('bytes=%d-%d' % (start, end))
87
+
88
+ def _fetch_suffix(self, n):
89
+ return self._range_request('bytes=-%d' % n)
90
+
91
+ def _range_request(self, range_header):
92
+ req = urllib.request.Request(self._url, headers={
93
+ 'Range': range_header,
94
+ 'User-Agent': 'storetle-remote/0.2.1',
95
+ })
96
+ with urllib.request.urlopen(req, timeout=self._timeout) as resp:
97
+ if resp.status not in (200, 206):
98
+ raise IOError('HTTP %d for %s' % (resp.status, self._url))
99
+ if resp.status == 200 and range_header != 'bytes=0-':
100
+ raise IOError(
101
+ 'Server ignored Range request — remote access needs a '
102
+ 'server that supports HTTP Range (got full response)')
103
+ data = resp.read()
104
+ self.bytes_fetched += len(data)
105
+ return data
106
+
107
+ # -- document access ----------------------------------------------------
108
+
109
+ def _load_chunk(self, ci):
110
+ if self._chunk_cache[0] == ci:
111
+ return self._chunk_cache[1]
112
+ off, expect_dc, _ = self._index[ci]
113
+ raw = self._fetch(off, self._chunk_ends[ci] - 1)
114
+ dc, _orig, comp_size = struct.unpack_from('>HII', raw, 0)
115
+ if dc != expect_dc:
116
+ raise ValueError('Chunk %d header disagrees with index' % ci)
117
+ sizes = struct.unpack_from('>%dI' % dc, raw, 10)
118
+ blob = _decompress(raw[10 + dc * 4: 10 + dc * 4 + comp_size], self._dict)
119
+ docs, pos = [], 0
120
+ for s in sizes:
121
+ docs.append(blob[pos:pos + s])
122
+ pos += s
123
+ self._chunk_cache = (ci, docs)
124
+ return docs
125
+
126
+ def _locate(self, idx):
127
+ lo, hi = 0, len(self._index) - 1
128
+ while lo < hi:
129
+ mid = (lo + hi) // 2
130
+ if self._cum[mid + 1] <= idx:
131
+ lo = mid + 1
132
+ else:
133
+ hi = mid
134
+ return lo
135
+
136
+ def __len__(self):
137
+ return self.doc_count
138
+
139
+ def __getitem__(self, idx):
140
+ if isinstance(idx, slice):
141
+ return [self[i] for i in range(*idx.indices(self.doc_count))]
142
+ if idx < 0:
143
+ idx += self.doc_count
144
+ if not 0 <= idx < self.doc_count:
145
+ raise IndexError('doc %d out of range (%d docs)' % (idx, self.doc_count))
146
+ ci = self._locate(idx)
147
+ docs = self._load_chunk(ci)
148
+ return _decode_doc(docs[idx - self._cum[ci]])
149
+
150
+ def __iter__(self):
151
+ for ci in range(len(self._index)):
152
+ for raw in self._load_chunk(ci):
153
+ yield _decode_doc(raw)
154
+
155
+ def get_text(self, idx):
156
+ """Return extracted plain text (no tags) for a single document."""
157
+ from .text import decode_text
158
+ if idx < 0:
159
+ idx += self.doc_count
160
+ if not 0 <= idx < self.doc_count:
161
+ raise IndexError('doc %d out of range (%d docs)' % (idx, self.doc_count))
162
+ ci = self._locate(idx)
163
+ return decode_text(self._load_chunk(ci)[idx - self._cum[ci]])
164
+
165
+ def iter_text(self):
166
+ """Yield extracted plain text for every document, in order."""
167
+ from .text import decode_text
168
+ for ci in range(len(self._index)):
169
+ for raw in self._load_chunk(ci):
170
+ yield decode_text(raw)
171
+
172
+ def info(self):
173
+ comp = self._chunk_ends[-1] - self._index[0][0] if self._index else 0
174
+ return {
175
+ 'docs': self.doc_count,
176
+ 'chunks': len(self._index),
177
+ 'original_bytes': sum(orig for _, _, orig in self._index),
178
+ 'compressed_bytes': comp,
179
+ 'ratio_pct': round(100 * (1 - comp / max(1, sum(o for _, _, o in self._index))), 2),
180
+ }
181
+
182
+ def close(self):
183
+ self._chunk_cache = (None, None)
184
+
185
+ def __enter__(self):
186
+ return self
187
+
188
+ def __exit__(self, *exc):
189
+ self.close()
190
+ return False
@@ -426,6 +426,19 @@ class StreamReader:
426
426
  raw = self._read_chunk(ci)[wi]
427
427
  return _decode_doc(raw)
428
428
 
429
+ def get_text(self, doc_idx: int):
430
+ """Return extracted plain text (no tags) for a single document."""
431
+ from .text import decode_text
432
+ ci, wi = self._locate(doc_idx)
433
+ return decode_text(self._read_chunk(ci)[wi])
434
+
435
+ def iter_text(self):
436
+ """Yield extracted plain text for every document, in order."""
437
+ from .text import decode_text
438
+ for ci in range(len(self._index)):
439
+ for raw in self._read_chunk(ci):
440
+ yield decode_text(raw)
441
+
429
442
  def __getitem__(self, key):
430
443
  if isinstance(key, int):
431
444
  return self.get(key)
@@ -0,0 +1,78 @@
1
+ # text.py — plain-text extraction straight from the NodeOp encoding.
2
+ #
3
+ # The encoded form already separates structure (struct stream) from content
4
+ # (content stream), so producing clean text never re-parses HTML: walk the
5
+ # opcodes, keep T_TEXT payloads, skip script/style bodies (T_RAWTEXT),
6
+ # comments and doctypes, and emit newlines at block-element boundaries.
7
+ #
8
+ # This consumes the content stream in exact lockstep with stream._decode_doc —
9
+ # every string the HTML decoder would read, this reads too, it just throws
10
+ # most of them away.
11
+
12
+ import re
13
+ import struct
14
+
15
+ from .decoder import _Stream
16
+ from .encoder import (T_OPEN, T_CLOSE, T_TEXT, T_DOCTYPE,
17
+ T_COMMENT, T_SELFCLOSE, T_RAWTEXT)
18
+ from .vocab import ID_TO_TAG, SHARED_STRINGS, UNKNOWN_ID
19
+
20
+ _BLOCK_TAGS = frozenset((
21
+ 'p', 'div', 'br', 'li', 'ul', 'ol', 'dl', 'dt', 'dd',
22
+ 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
23
+ 'table', 'tr', 'caption', 'thead', 'tbody',
24
+ 'section', 'article', 'aside', 'header', 'footer', 'main', 'nav',
25
+ 'blockquote', 'pre', 'figure', 'figcaption', 'hr', 'title',
26
+ ))
27
+ _CELL_TAGS = frozenset(('td', 'th'))
28
+
29
+ _collapse_blank = re.compile(r'\n\s*\n+')
30
+ _collapse_space = re.compile(r'[ \t\f\v]+')
31
+
32
+
33
+ def decode_text(raw: bytes) -> bytes:
34
+ """Extract plain text from one encoded document (the blob stored in a
35
+ chunk), without reconstructing HTML."""
36
+ ss_len = struct.unpack_from('>I', raw, 0)[0]
37
+ ss = _Stream(raw[4: 4 + ss_len])
38
+ cs = _Stream(raw[4 + ss_len:])
39
+ ss_data_len = ss_len
40
+
41
+ out = []
42
+
43
+ def boundary(tag):
44
+ if tag in _BLOCK_TAGS:
45
+ out.append('\n')
46
+ elif tag in _CELL_TAGS:
47
+ out.append('\t')
48
+
49
+ while ss._pos < ss_data_len:
50
+ nt = ss.read_byte()
51
+
52
+ if nt in (T_OPEN, T_SELFCLOSE):
53
+ tag_id = ss.read_byte()
54
+ tag = cs.read_string(SHARED_STRINGS) if tag_id == UNKNOWN_ID \
55
+ else ID_TO_TAG.get(tag_id, '')
56
+ ac = ss.read_byte()
57
+ for _ in range(ac):
58
+ aid = ss.read_byte()
59
+ if aid == UNKNOWN_ID:
60
+ cs.read_string(SHARED_STRINGS) # attr name — discard
61
+ cs.read_string(SHARED_STRINGS) # attr value — discard
62
+ boundary(tag)
63
+
64
+ elif nt == T_CLOSE:
65
+ pass # no payload; block boundary handled at open
66
+
67
+ elif nt == T_TEXT:
68
+ t = cs.read_string(SHARED_STRINGS)
69
+ if t:
70
+ out.append(t)
71
+
72
+ elif nt in (T_RAWTEXT, T_DOCTYPE, T_COMMENT):
73
+ cs.read_string(SHARED_STRINGS) # script/style/meta — discard
74
+
75
+ text = ''.join(out)
76
+ text = _collapse_space.sub(' ', text)
77
+ text = _collapse_blank.sub('\n', text)
78
+ return text.strip().encode('utf-8')
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: storetle
3
- Version: 0.2.0
3
+ Version: 0.2.2
4
4
  Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
5
5
  Author-email: Davis Brief <davis@team8.co>
6
6
  License: MIT
@@ -96,6 +96,53 @@ storetle to-warc archive.storetle out.warc.gz
96
96
  storetle train my_corpus/ --output my.bin # domain-specific dictionary
97
97
  ```
98
98
 
99
+ ## Hosted corpora — free
100
+
101
+ **Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
102
+ in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
103
+ JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
104
+
105
+ ```
106
+ https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
107
+ ```
108
+
109
+ Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
110
+
111
+ ```bash
112
+ storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 # Albert Einstein, full HTML
113
+ storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text # …as clean plain text
114
+ ```
115
+
116
+ Find a title's index by grepping the shard's `.index.jsonl`. More corpora
117
+ (arXiv, PubMed Central OA) coming.
118
+
119
+ ## Plain text extraction (v0.2.2)
120
+
121
+ `--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
122
+ extracts tag-stripped clean text **without re-parsing HTML** — the encoding
123
+ already separates structure from content, so text extraction is a walk over
124
+ the structure opcodes that keeps text nodes, drops script/style bodies, and
125
+ emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
126
+ of readable text.
127
+
128
+ ## Remote archives (v0.2.1)
129
+
130
+ `get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB
131
+ of Range requests; fetching a document downloads only its ~2MB chunk — no
132
+ server-side code, works against any Range-capable host (R2, S3, GitHub
133
+ Pages, nginx):
134
+
135
+ ```bash
136
+ storetle info https://adventurelands.github.io/storetle/sample.storetle
137
+ storetle get https://adventurelands.github.io/storetle/sample.storetle 4
138
+ ```
139
+
140
+ ```python
141
+ from storetle import RemoteReader
142
+ with RemoteReader('https://host/corpus.storetle') as r:
143
+ html = r[42] # one ~2MB range request
144
+ ```
145
+
99
146
  ## Python API
100
147
 
101
148
  ```python
@@ -8,7 +8,9 @@ storetle/cube_dict_v10.bin
8
8
  storetle/decoder.py
9
9
  storetle/encoder.py
10
10
  storetle/folder.py
11
+ storetle/remote.py
11
12
  storetle/stream.py
13
+ storetle/text.py
12
14
  storetle/vocab.py
13
15
  storetle/warc.py
14
16
  storetle/zstd_compat.py
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes