storetle 0.2.2__tar.gz → 0.3.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {storetle-0.2.2/storetle.egg-info → storetle-0.3.1}/PKG-INFO +22 -16
- {storetle-0.2.2 → storetle-0.3.1}/README.md +21 -15
- {storetle-0.2.2 → storetle-0.3.1}/pyproject.toml +1 -1
- {storetle-0.2.2 → storetle-0.3.1}/storetle/__init__.py +1 -1
- {storetle-0.2.2 → storetle-0.3.1}/storetle/cli.py +30 -4
- storetle-0.3.1/storetle/registry.py +104 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/stream.py +29 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/text.py +37 -8
- {storetle-0.2.2 → storetle-0.3.1/storetle.egg-info}/PKG-INFO +22 -16
- {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/SOURCES.txt +1 -0
- {storetle-0.2.2 → storetle-0.3.1}/LICENSE +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/setup.cfg +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/brotli_compat.py +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/cube_dict_v10.bin +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/decoder.py +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/encoder.py +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/folder.py +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/remote.py +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/vocab.py +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/warc.py +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle/zstd_compat.py +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/dependency_links.txt +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/entry_points.txt +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/requires.txt +0 -0
- {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: storetle
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.1
|
|
4
4
|
Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
|
|
5
5
|
Author-email: Davis Brief <davis@team8.co>
|
|
6
6
|
License: MIT
|
|
@@ -98,23 +98,29 @@ storetle train my_corpus/ --output my.bin # domain-specific dictionary
|
|
|
98
98
|
|
|
99
99
|
## Hosted corpora — free
|
|
100
100
|
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
```
|
|
106
|
-
https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
|
|
101
|
+
```bash
|
|
102
|
+
storetle corpora # list what's available
|
|
103
|
+
storetle get wiki "Albert Einstein" --text # one article, by name, ~2s
|
|
104
|
+
storetle get wiki-text "Black hole" # from the clean-text edition
|
|
107
105
|
```
|
|
108
106
|
|
|
109
|
-
|
|
107
|
+
Corpus names resolve through a public registry
|
|
108
|
+
(`https://data.davisbrief.com/corpora.json`) — new corpora appear without a
|
|
109
|
+
package update. Title lookup fetches a small index once and caches it.
|
|
110
110
|
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
111
|
+
**Available now — Simple English Wikipedia, complete** (267,503 articles,
|
|
112
|
+
snapshot 2025-03-20, CC-BY-SA-4.0):
|
|
113
|
+
|
|
114
|
+
| edition | size | contents |
|
|
115
|
+
|---|---|---|
|
|
116
|
+
| `wiki` | 843 MB / 6 shards | full article HTML (10.06 GB raw) |
|
|
117
|
+
| `wiki-text` | 196 MB / 1 file | clean plain text, random access |
|
|
118
|
+
| `…jsonl.zst` | 168 MB | `{"title","url","text"}` per line, for ML pipelines |
|
|
115
119
|
|
|
116
|
-
|
|
117
|
-
|
|
120
|
+
All under `https://data.davisbrief.com/simplewiki/` with JSONL metadata
|
|
121
|
+
indexes and a SHA-256 manifest. The entire text of Simple English Wikipedia
|
|
122
|
+
in 196 MB, where any article is one ~2 MB range request away — that's the
|
|
123
|
+
point of the format. More corpora (arXiv, PubMed Central OA) coming.
|
|
118
124
|
|
|
119
125
|
## Plain text extraction (v0.2.2)
|
|
120
126
|
|
|
@@ -133,8 +139,8 @@ server-side code, works against any Range-capable host (R2, S3, GitHub
|
|
|
133
139
|
Pages, nginx):
|
|
134
140
|
|
|
135
141
|
```bash
|
|
136
|
-
storetle info https://
|
|
137
|
-
storetle get
|
|
142
|
+
storetle info https://data.davisbrief.com/simplewiki/simplewiki-text-20250320.storetle
|
|
143
|
+
storetle get wiki "Albert Einstein" --text
|
|
138
144
|
```
|
|
139
145
|
|
|
140
146
|
```python
|
|
@@ -75,23 +75,29 @@ storetle train my_corpus/ --output my.bin # domain-specific dictionary
|
|
|
75
75
|
|
|
76
76
|
## Hosted corpora — free
|
|
77
77
|
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
```
|
|
83
|
-
https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
|
|
78
|
+
```bash
|
|
79
|
+
storetle corpora # list what's available
|
|
80
|
+
storetle get wiki "Albert Einstein" --text # one article, by name, ~2s
|
|
81
|
+
storetle get wiki-text "Black hole" # from the clean-text edition
|
|
84
82
|
```
|
|
85
83
|
|
|
86
|
-
|
|
84
|
+
Corpus names resolve through a public registry
|
|
85
|
+
(`https://data.davisbrief.com/corpora.json`) — new corpora appear without a
|
|
86
|
+
package update. Title lookup fetches a small index once and caches it.
|
|
87
87
|
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
88
|
+
**Available now — Simple English Wikipedia, complete** (267,503 articles,
|
|
89
|
+
snapshot 2025-03-20, CC-BY-SA-4.0):
|
|
90
|
+
|
|
91
|
+
| edition | size | contents |
|
|
92
|
+
|---|---|---|
|
|
93
|
+
| `wiki` | 843 MB / 6 shards | full article HTML (10.06 GB raw) |
|
|
94
|
+
| `wiki-text` | 196 MB / 1 file | clean plain text, random access |
|
|
95
|
+
| `…jsonl.zst` | 168 MB | `{"title","url","text"}` per line, for ML pipelines |
|
|
92
96
|
|
|
93
|
-
|
|
94
|
-
|
|
97
|
+
All under `https://data.davisbrief.com/simplewiki/` with JSONL metadata
|
|
98
|
+
indexes and a SHA-256 manifest. The entire text of Simple English Wikipedia
|
|
99
|
+
in 196 MB, where any article is one ~2 MB range request away — that's the
|
|
100
|
+
point of the format. More corpora (arXiv, PubMed Central OA) coming.
|
|
95
101
|
|
|
96
102
|
## Plain text extraction (v0.2.2)
|
|
97
103
|
|
|
@@ -110,8 +116,8 @@ server-side code, works against any Range-capable host (R2, S3, GitHub
|
|
|
110
116
|
Pages, nginx):
|
|
111
117
|
|
|
112
118
|
```bash
|
|
113
|
-
storetle info https://
|
|
114
|
-
storetle get
|
|
119
|
+
storetle info https://data.davisbrief.com/simplewiki/simplewiki-text-20250320.storetle
|
|
120
|
+
storetle get wiki "Albert Einstein" --text
|
|
115
121
|
```
|
|
116
122
|
|
|
117
123
|
```python
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "storetle"
|
|
7
|
-
version = "0.
|
|
7
|
+
version = "0.3.1"
|
|
8
8
|
description = "HTML-aware compression for document corpora — solid-archive ratios with random access"
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
requires-python = ">=3.8"
|
|
@@ -11,7 +11,7 @@ from .stream import StreamWriter, StreamReader
|
|
|
11
11
|
from .remote import RemoteReader
|
|
12
12
|
from .folder import pack, unpack
|
|
13
13
|
|
|
14
|
-
__version__ = '0.
|
|
14
|
+
__version__ = '0.3.1'
|
|
15
15
|
__all__ = ['StreamWriter', 'StreamReader', 'RemoteReader', 'pack', 'unpack', 'benchmark']
|
|
16
16
|
|
|
17
17
|
|
|
@@ -117,11 +117,23 @@ def cmd_get(args):
|
|
|
117
117
|
text = '--text' in args
|
|
118
118
|
args = [a for a in args if a != '--text']
|
|
119
119
|
if len(args) < 2:
|
|
120
|
-
print('Usage: storetle get <file
|
|
120
|
+
print('Usage: storetle get <file|url|corpus> <index|title> [--text]\n'
|
|
121
|
+
' storetle get wiki "Albert Einstein" --text')
|
|
121
122
|
sys.exit(1)
|
|
122
|
-
|
|
123
|
+
|
|
124
|
+
src, ref = args[0], ' '.join(args[1:])
|
|
125
|
+
if not _is_url(src) and not Path(src).exists():
|
|
126
|
+
# treat as a named corpus from the public registry
|
|
127
|
+
from .registry import resolve
|
|
128
|
+
try:
|
|
129
|
+
src, ref = resolve(src, ref)
|
|
130
|
+
except (KeyError, IndexError) as e:
|
|
131
|
+
print(f'Error: {e}')
|
|
132
|
+
sys.exit(1)
|
|
133
|
+
|
|
134
|
+
with _open_reader(src) as r:
|
|
123
135
|
try:
|
|
124
|
-
idx = int(
|
|
136
|
+
idx = int(ref)
|
|
125
137
|
doc = r.get_text(idx) if text else r[idx]
|
|
126
138
|
sys.stdout.buffer.write(doc)
|
|
127
139
|
if text:
|
|
@@ -131,6 +143,17 @@ def cmd_get(args):
|
|
|
131
143
|
sys.exit(1)
|
|
132
144
|
|
|
133
145
|
|
|
146
|
+
def cmd_corpora(args):
|
|
147
|
+
from .registry import list_corpora
|
|
148
|
+
print()
|
|
149
|
+
for name, info in list_corpora().items():
|
|
150
|
+
print(f' {name:12s} {info.get("title","")} '
|
|
151
|
+
f'[{info.get("docs","?"):,} docs, {info.get("snapshot","")}, '
|
|
152
|
+
f'{info.get("license","")}]')
|
|
153
|
+
print('\n Usage: storetle get <corpus> <title-or-index> [--text]')
|
|
154
|
+
print()
|
|
155
|
+
|
|
156
|
+
|
|
134
157
|
def cmd_from_warc(args):
|
|
135
158
|
if len(args) < 2:
|
|
136
159
|
print('Usage: storetle from-warc <input.warc[.gz]> <output.storetle>')
|
|
@@ -262,6 +285,7 @@ def cmd_warc_decode(args):
|
|
|
262
285
|
|
|
263
286
|
COMMANDS = {
|
|
264
287
|
'bench': cmd_bench,
|
|
288
|
+
'corpora': cmd_corpora,
|
|
265
289
|
'pack': cmd_pack,
|
|
266
290
|
'unpack': cmd_unpack,
|
|
267
291
|
'info': cmd_info,
|
|
@@ -276,11 +300,13 @@ COMMANDS = {
|
|
|
276
300
|
HELP = """storetle — HTML-aware compression for large document collections
|
|
277
301
|
|
|
278
302
|
Commands:
|
|
303
|
+
corpora List free hosted corpora
|
|
279
304
|
bench <folder> Benchmark your HTML data vs gzip WARC
|
|
280
305
|
pack <folder> <output> Compress a folder → .storetle file
|
|
281
306
|
unpack <file-or-url> <out> [--text] Extract → HTML files (or clean .txt)
|
|
282
307
|
info <file-or-url> Show file statistics
|
|
283
|
-
get <file
|
|
308
|
+
get <file|url|corpus> <ref> Extract one doc by index or title — remote
|
|
309
|
+
reads fetch only the containing ~2MB chunk.
|
|
284
310
|
fetches only the containing ~2MB chunk.
|
|
285
311
|
Add --text for tag-stripped plain text
|
|
286
312
|
from-warc <input.warc[.gz]> <out> Convert WARC → .storetle
|
|
@@ -0,0 +1,104 @@
|
|
|
1
|
+
# registry.py — named corpora: `storetle get wiki "Albert Einstein"`.
|
|
2
|
+
#
|
|
3
|
+
# A corpus registry (corpora.json) lives next to the hosted data, so new
|
|
4
|
+
# corpora appear without a new package release. Title→location maps are
|
|
5
|
+
# fetched once and cached under ~/.cache/storetle/.
|
|
6
|
+
|
|
7
|
+
import gzip
|
|
8
|
+
import json
|
|
9
|
+
import time
|
|
10
|
+
import urllib.request
|
|
11
|
+
from pathlib import Path
|
|
12
|
+
|
|
13
|
+
REGISTRY_URL = 'https://data.davisbrief.com/corpora.json'
|
|
14
|
+
CACHE_DIR = Path.home() / '.cache' / 'storetle'
|
|
15
|
+
_REGISTRY_TTL = 3600
|
|
16
|
+
|
|
17
|
+
|
|
18
|
+
def _fetch(url, timeout=30):
|
|
19
|
+
req = urllib.request.Request(url, headers={'User-Agent': 'storetle-cli'})
|
|
20
|
+
with urllib.request.urlopen(req, timeout=timeout) as r:
|
|
21
|
+
return r.read()
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
def load_registry():
|
|
25
|
+
"""Fetch corpora.json, with a small on-disk cache."""
|
|
26
|
+
CACHE_DIR.mkdir(parents=True, exist_ok=True)
|
|
27
|
+
cache = CACHE_DIR / 'corpora.json'
|
|
28
|
+
if cache.exists() and time.time() - cache.stat().st_mtime < _REGISTRY_TTL:
|
|
29
|
+
return json.loads(cache.read_text())
|
|
30
|
+
try:
|
|
31
|
+
data = _fetch(REGISTRY_URL)
|
|
32
|
+
cache.write_bytes(data)
|
|
33
|
+
return json.loads(data)
|
|
34
|
+
except Exception:
|
|
35
|
+
if cache.exists(): # stale beats broken
|
|
36
|
+
return json.loads(cache.read_text())
|
|
37
|
+
raise
|
|
38
|
+
|
|
39
|
+
|
|
40
|
+
def _titles_path(corpus_name, entry):
|
|
41
|
+
"""Download (once) and cache the corpus title map."""
|
|
42
|
+
CACHE_DIR.mkdir(parents=True, exist_ok=True)
|
|
43
|
+
local = CACHE_DIR / f'titles-{corpus_name}.tsv.gz'
|
|
44
|
+
if not local.exists():
|
|
45
|
+
url = entry['base'].rstrip('/') + '/' + entry['titles']
|
|
46
|
+
print(f'[storetle] fetching title index for "{corpus_name}" '
|
|
47
|
+
f'({url.rsplit("/",1)[-1]}, one-time)...')
|
|
48
|
+
local.write_bytes(_fetch(url))
|
|
49
|
+
return local
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
def _lookup_title(corpus_name, entry, title):
|
|
53
|
+
"""Resolve a title to (shard_no, doc_idx). Exact, then case-insensitive."""
|
|
54
|
+
want = title.strip()
|
|
55
|
+
want_ci = want.lower()
|
|
56
|
+
ci_hit = None
|
|
57
|
+
with gzip.open(_titles_path(corpus_name, entry), 'rt') as f:
|
|
58
|
+
for line in f:
|
|
59
|
+
name, shard, idx = line.rstrip('\n').rsplit('\t', 2)
|
|
60
|
+
if name == want:
|
|
61
|
+
return int(shard), int(idx)
|
|
62
|
+
if ci_hit is None and name.lower() == want_ci:
|
|
63
|
+
ci_hit = (int(shard), int(idx))
|
|
64
|
+
if ci_hit:
|
|
65
|
+
return ci_hit
|
|
66
|
+
raise KeyError(f'title not found in corpus "{corpus_name}": {title!r}')
|
|
67
|
+
|
|
68
|
+
|
|
69
|
+
def resolve(corpus_name, ref):
|
|
70
|
+
"""Resolve (corpus, index-or-title) → (shard_url, doc_idx).
|
|
71
|
+
|
|
72
|
+
ref may be an integer global index or a document title.
|
|
73
|
+
"""
|
|
74
|
+
reg = load_registry()
|
|
75
|
+
if corpus_name not in reg:
|
|
76
|
+
raise KeyError(f'unknown corpus {corpus_name!r}; '
|
|
77
|
+
f'available: {", ".join(sorted(reg))}')
|
|
78
|
+
entry = reg[corpus_name]
|
|
79
|
+
base = entry['base'].rstrip('/')
|
|
80
|
+
shards = entry['shards']
|
|
81
|
+
|
|
82
|
+
try:
|
|
83
|
+
gidx = int(ref)
|
|
84
|
+
except (TypeError, ValueError):
|
|
85
|
+
shard_no, idx = _lookup_title(corpus_name, entry, ref)
|
|
86
|
+
return f'{base}/{shards[shard_no]}', idx
|
|
87
|
+
|
|
88
|
+
# integer: map global index onto shards via per-shard doc counts
|
|
89
|
+
counts = entry.get('shard_docs') or []
|
|
90
|
+
if not counts:
|
|
91
|
+
return f'{base}/{shards[0]}', gidx
|
|
92
|
+
run = 0
|
|
93
|
+
for shard_no, n in enumerate(counts):
|
|
94
|
+
if gidx < run + n:
|
|
95
|
+
return f'{base}/{shards[shard_no]}', gidx - run
|
|
96
|
+
run += n
|
|
97
|
+
raise IndexError(f'index {gidx} out of range ({run} docs in corpus)')
|
|
98
|
+
|
|
99
|
+
|
|
100
|
+
def list_corpora():
|
|
101
|
+
reg = load_registry()
|
|
102
|
+
return {name: {k: v for k, v in e.items() if k in
|
|
103
|
+
('title', 'docs', 'snapshot', 'license')}
|
|
104
|
+
for name, e in sorted(reg.items())}
|
|
@@ -84,6 +84,18 @@ def _encode_doc(html):
|
|
|
84
84
|
return struct.pack('>I', len(ss)) + ss + cs
|
|
85
85
|
|
|
86
86
|
|
|
87
|
+
def _encode_text_doc(text):
|
|
88
|
+
"""Encode a plain-text document as a single text node.
|
|
89
|
+
|
|
90
|
+
Same container, same decoders: HTML readers see the text (escaped),
|
|
91
|
+
--text / get_text() return it verbatim. This is how text-mode corpora
|
|
92
|
+
(e.g. clean-text Wikipedia) are stored without any format change."""
|
|
93
|
+
if isinstance(text, bytes):
|
|
94
|
+
text = text.decode('utf-8', errors='replace')
|
|
95
|
+
ss, cs = _build_streams_class_split([(T_TEXT, None, text)])
|
|
96
|
+
return struct.pack('>I', len(ss)) + ss + cs
|
|
97
|
+
|
|
98
|
+
|
|
87
99
|
def _decode_doc(raw):
|
|
88
100
|
"""Decode a v2 blob back to reconstructed HTML bytes."""
|
|
89
101
|
from .decoder import _Stream
|
|
@@ -229,6 +241,23 @@ class StreamWriter:
|
|
|
229
241
|
else:
|
|
230
242
|
self._append_sync(html)
|
|
231
243
|
|
|
244
|
+
def append_text(self, text):
|
|
245
|
+
"""Encode and buffer one plain-text document (no HTML parsing)."""
|
|
246
|
+
if self._workers > 1:
|
|
247
|
+
# preserve document order: settle in-flight HTML encodes first
|
|
248
|
+
self._drain_all()
|
|
249
|
+
if isinstance(text, str):
|
|
250
|
+
data = text.encode('utf-8', errors='replace')
|
|
251
|
+
else:
|
|
252
|
+
data = text
|
|
253
|
+
self._total_orig += len(data)
|
|
254
|
+
raw = _encode_text_doc(data)
|
|
255
|
+
self._chunk_buf.append(raw)
|
|
256
|
+
self._chunk_bytes += len(raw)
|
|
257
|
+
self._total_docs += 1
|
|
258
|
+
if len(self._chunk_buf) >= CHUNK_DOCS or self._chunk_bytes >= CHUNK_BYTES:
|
|
259
|
+
self._flush_chunk()
|
|
260
|
+
|
|
232
261
|
def _append_sync(self, html):
|
|
233
262
|
if isinstance(html, str):
|
|
234
263
|
try:
|
|
@@ -15,7 +15,7 @@ import struct
|
|
|
15
15
|
from .decoder import _Stream
|
|
16
16
|
from .encoder import (T_OPEN, T_CLOSE, T_TEXT, T_DOCTYPE,
|
|
17
17
|
T_COMMENT, T_SELFCLOSE, T_RAWTEXT)
|
|
18
|
-
from .vocab import ID_TO_TAG, SHARED_STRINGS, UNKNOWN_ID
|
|
18
|
+
from .vocab import ID_TO_TAG, ID_TO_ATTR, SHARED_STRINGS, UNKNOWN_ID
|
|
19
19
|
|
|
20
20
|
_BLOCK_TAGS = frozenset((
|
|
21
21
|
'p', 'div', 'br', 'li', 'ul', 'ol', 'dl', 'dt', 'dd',
|
|
@@ -29,6 +29,25 @@ _CELL_TAGS = frozenset(('td', 'th'))
|
|
|
29
29
|
_collapse_blank = re.compile(r'\n\s*\n+')
|
|
30
30
|
_collapse_space = re.compile(r'[ \t\f\v]+')
|
|
31
31
|
|
|
32
|
+
# Elements whose entire subtree is navigation/boilerplate, not content.
|
|
33
|
+
# Matched against class tokens (substring) and role attribute values.
|
|
34
|
+
_SKIP_CLASS_SUBSTR = ('navbox', 'catlinks', 'mw-jump', 'printfooter',
|
|
35
|
+
'mw-editsection', 'breadcrumb', 'site-nav')
|
|
36
|
+
_SKIP_ROLES = ('navigation',)
|
|
37
|
+
|
|
38
|
+
|
|
39
|
+
def _is_boilerplate(attrs):
|
|
40
|
+
for name, value in attrs:
|
|
41
|
+
if not value:
|
|
42
|
+
continue
|
|
43
|
+
if name == 'role' and value.lower() in _SKIP_ROLES:
|
|
44
|
+
return True
|
|
45
|
+
if name == 'class':
|
|
46
|
+
v = value.lower()
|
|
47
|
+
if any(s in v for s in _SKIP_CLASS_SUBSTR):
|
|
48
|
+
return True
|
|
49
|
+
return False
|
|
50
|
+
|
|
32
51
|
|
|
33
52
|
def decode_text(raw: bytes) -> bytes:
|
|
34
53
|
"""Extract plain text from one encoded document (the blob stored in a
|
|
@@ -39,6 +58,8 @@ def decode_text(raw: bytes) -> bytes:
|
|
|
39
58
|
ss_data_len = ss_len
|
|
40
59
|
|
|
41
60
|
out = []
|
|
61
|
+
depth = 0
|
|
62
|
+
skip_above = None # while set, drop text until depth returns here
|
|
42
63
|
|
|
43
64
|
def boundary(tag):
|
|
44
65
|
if tag in _BLOCK_TAGS:
|
|
@@ -54,19 +75,27 @@ def decode_text(raw: bytes) -> bytes:
|
|
|
54
75
|
tag = cs.read_string(SHARED_STRINGS) if tag_id == UNKNOWN_ID \
|
|
55
76
|
else ID_TO_TAG.get(tag_id, '')
|
|
56
77
|
ac = ss.read_byte()
|
|
78
|
+
attrs = []
|
|
57
79
|
for _ in range(ac):
|
|
58
80
|
aid = ss.read_byte()
|
|
59
|
-
if aid == UNKNOWN_ID
|
|
60
|
-
|
|
61
|
-
cs.read_string(SHARED_STRINGS)
|
|
62
|
-
|
|
81
|
+
aname = cs.read_string(SHARED_STRINGS) if aid == UNKNOWN_ID \
|
|
82
|
+
else ID_TO_ATTR.get(aid, '')
|
|
83
|
+
attrs.append((aname, cs.read_string(SHARED_STRINGS)))
|
|
84
|
+
if nt == T_OPEN:
|
|
85
|
+
depth += 1
|
|
86
|
+
if skip_above is None and _is_boilerplate(attrs):
|
|
87
|
+
skip_above = depth - 1
|
|
88
|
+
if skip_above is None:
|
|
89
|
+
boundary(tag)
|
|
63
90
|
|
|
64
91
|
elif nt == T_CLOSE:
|
|
65
|
-
|
|
92
|
+
depth = max(0, depth - 1)
|
|
93
|
+
if skip_above is not None and depth <= skip_above:
|
|
94
|
+
skip_above = None
|
|
66
95
|
|
|
67
96
|
elif nt == T_TEXT:
|
|
68
97
|
t = cs.read_string(SHARED_STRINGS)
|
|
69
|
-
if t:
|
|
98
|
+
if t and skip_above is None:
|
|
70
99
|
out.append(t)
|
|
71
100
|
|
|
72
101
|
elif nt in (T_RAWTEXT, T_DOCTYPE, T_COMMENT):
|
|
@@ -74,5 +103,5 @@ def decode_text(raw: bytes) -> bytes:
|
|
|
74
103
|
|
|
75
104
|
text = ''.join(out)
|
|
76
105
|
text = _collapse_space.sub(' ', text)
|
|
77
|
-
text = _collapse_blank.sub('\n', text)
|
|
106
|
+
text = _collapse_blank.sub('\n\n', text) # keep paragraph boundaries
|
|
78
107
|
return text.strip().encode('utf-8')
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: storetle
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.1
|
|
4
4
|
Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
|
|
5
5
|
Author-email: Davis Brief <davis@team8.co>
|
|
6
6
|
License: MIT
|
|
@@ -98,23 +98,29 @@ storetle train my_corpus/ --output my.bin # domain-specific dictionary
|
|
|
98
98
|
|
|
99
99
|
## Hosted corpora — free
|
|
100
100
|
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
```
|
|
106
|
-
https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
|
|
101
|
+
```bash
|
|
102
|
+
storetle corpora # list what's available
|
|
103
|
+
storetle get wiki "Albert Einstein" --text # one article, by name, ~2s
|
|
104
|
+
storetle get wiki-text "Black hole" # from the clean-text edition
|
|
107
105
|
```
|
|
108
106
|
|
|
109
|
-
|
|
107
|
+
Corpus names resolve through a public registry
|
|
108
|
+
(`https://data.davisbrief.com/corpora.json`) — new corpora appear without a
|
|
109
|
+
package update. Title lookup fetches a small index once and caches it.
|
|
110
110
|
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
111
|
+
**Available now — Simple English Wikipedia, complete** (267,503 articles,
|
|
112
|
+
snapshot 2025-03-20, CC-BY-SA-4.0):
|
|
113
|
+
|
|
114
|
+
| edition | size | contents |
|
|
115
|
+
|---|---|---|
|
|
116
|
+
| `wiki` | 843 MB / 6 shards | full article HTML (10.06 GB raw) |
|
|
117
|
+
| `wiki-text` | 196 MB / 1 file | clean plain text, random access |
|
|
118
|
+
| `…jsonl.zst` | 168 MB | `{"title","url","text"}` per line, for ML pipelines |
|
|
115
119
|
|
|
116
|
-
|
|
117
|
-
|
|
120
|
+
All under `https://data.davisbrief.com/simplewiki/` with JSONL metadata
|
|
121
|
+
indexes and a SHA-256 manifest. The entire text of Simple English Wikipedia
|
|
122
|
+
in 196 MB, where any article is one ~2 MB range request away — that's the
|
|
123
|
+
point of the format. More corpora (arXiv, PubMed Central OA) coming.
|
|
118
124
|
|
|
119
125
|
## Plain text extraction (v0.2.2)
|
|
120
126
|
|
|
@@ -133,8 +139,8 @@ server-side code, works against any Range-capable host (R2, S3, GitHub
|
|
|
133
139
|
Pages, nginx):
|
|
134
140
|
|
|
135
141
|
```bash
|
|
136
|
-
storetle info https://
|
|
137
|
-
storetle get
|
|
142
|
+
storetle info https://data.davisbrief.com/simplewiki/simplewiki-text-20250320.storetle
|
|
143
|
+
storetle get wiki "Albert Einstein" --text
|
|
138
144
|
```
|
|
139
145
|
|
|
140
146
|
```python
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|