storetle 0.2.0__tar.gz → 0.2.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {storetle-0.2.0/storetle.egg-info → storetle-0.2.2}/PKG-INFO +48 -1
- {storetle-0.2.0 → storetle-0.2.2}/README.md +47 -0
- {storetle-0.2.0 → storetle-0.2.2}/pyproject.toml +1 -1
- {storetle-0.2.0 → storetle-0.2.2}/storetle/__init__.py +3 -2
- {storetle-0.2.0 → storetle-0.2.2}/storetle/cli.py +42 -16
- storetle-0.2.2/storetle/remote.py +190 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle/stream.py +13 -0
- storetle-0.2.2/storetle/text.py +78 -0
- {storetle-0.2.0 → storetle-0.2.2/storetle.egg-info}/PKG-INFO +48 -1
- {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/SOURCES.txt +2 -0
- {storetle-0.2.0 → storetle-0.2.2}/LICENSE +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/setup.cfg +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle/brotli_compat.py +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle/cube_dict_v10.bin +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle/decoder.py +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle/encoder.py +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle/folder.py +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle/vocab.py +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle/warc.py +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle/zstd_compat.py +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/dependency_links.txt +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/entry_points.txt +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/requires.txt +0 -0
- {storetle-0.2.0 → storetle-0.2.2}/storetle.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: storetle
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.2
|
|
4
4
|
Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
|
|
5
5
|
Author-email: Davis Brief <davis@team8.co>
|
|
6
6
|
License: MIT
|
|
@@ -96,6 +96,53 @@ storetle to-warc archive.storetle out.warc.gz
|
|
|
96
96
|
storetle train my_corpus/ --output my.bin # domain-specific dictionary
|
|
97
97
|
```
|
|
98
98
|
|
|
99
|
+
## Hosted corpora — free
|
|
100
|
+
|
|
101
|
+
**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
|
|
102
|
+
in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
|
|
103
|
+
JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
|
|
104
|
+
|
|
105
|
+
```
|
|
106
|
+
https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 # Albert Einstein, full HTML
|
|
113
|
+
storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text # …as clean plain text
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Find a title's index by grepping the shard's `.index.jsonl`. More corpora
|
|
117
|
+
(arXiv, PubMed Central OA) coming.
|
|
118
|
+
|
|
119
|
+
## Plain text extraction (v0.2.2)
|
|
120
|
+
|
|
121
|
+
`--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
|
|
122
|
+
extracts tag-stripped clean text **without re-parsing HTML** — the encoding
|
|
123
|
+
already separates structure from content, so text extraction is a walk over
|
|
124
|
+
the structure opcodes that keeps text nodes, drops script/style bodies, and
|
|
125
|
+
emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
|
|
126
|
+
of readable text.
|
|
127
|
+
|
|
128
|
+
## Remote archives (v0.2.1)
|
|
129
|
+
|
|
130
|
+
`get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB
|
|
131
|
+
of Range requests; fetching a document downloads only its ~2MB chunk — no
|
|
132
|
+
server-side code, works against any Range-capable host (R2, S3, GitHub
|
|
133
|
+
Pages, nginx):
|
|
134
|
+
|
|
135
|
+
```bash
|
|
136
|
+
storetle info https://adventurelands.github.io/storetle/sample.storetle
|
|
137
|
+
storetle get https://adventurelands.github.io/storetle/sample.storetle 4
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
```python
|
|
141
|
+
from storetle import RemoteReader
|
|
142
|
+
with RemoteReader('https://host/corpus.storetle') as r:
|
|
143
|
+
html = r[42] # one ~2MB range request
|
|
144
|
+
```
|
|
145
|
+
|
|
99
146
|
## Python API
|
|
100
147
|
|
|
101
148
|
```python
|
|
@@ -73,6 +73,53 @@ storetle to-warc archive.storetle out.warc.gz
|
|
|
73
73
|
storetle train my_corpus/ --output my.bin # domain-specific dictionary
|
|
74
74
|
```
|
|
75
75
|
|
|
76
|
+
## Hosted corpora — free
|
|
77
|
+
|
|
78
|
+
**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
|
|
79
|
+
in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
|
|
80
|
+
JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 # Albert Einstein, full HTML
|
|
90
|
+
storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text # …as clean plain text
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
Find a title's index by grepping the shard's `.index.jsonl`. More corpora
|
|
94
|
+
(arXiv, PubMed Central OA) coming.
|
|
95
|
+
|
|
96
|
+
## Plain text extraction (v0.2.2)
|
|
97
|
+
|
|
98
|
+
`--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
|
|
99
|
+
extracts tag-stripped clean text **without re-parsing HTML** — the encoding
|
|
100
|
+
already separates structure from content, so text extraction is a walk over
|
|
101
|
+
the structure opcodes that keeps text nodes, drops script/style bodies, and
|
|
102
|
+
emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
|
|
103
|
+
of readable text.
|
|
104
|
+
|
|
105
|
+
## Remote archives (v0.2.1)
|
|
106
|
+
|
|
107
|
+
`get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB
|
|
108
|
+
of Range requests; fetching a document downloads only its ~2MB chunk — no
|
|
109
|
+
server-side code, works against any Range-capable host (R2, S3, GitHub
|
|
110
|
+
Pages, nginx):
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
storetle info https://adventurelands.github.io/storetle/sample.storetle
|
|
114
|
+
storetle get https://adventurelands.github.io/storetle/sample.storetle 4
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
```python
|
|
118
|
+
from storetle import RemoteReader
|
|
119
|
+
with RemoteReader('https://host/corpus.storetle') as r:
|
|
120
|
+
html = r[42] # one ~2MB range request
|
|
121
|
+
```
|
|
122
|
+
|
|
76
123
|
## Python API
|
|
77
124
|
|
|
78
125
|
```python
|
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|
|
4
4
|
|
|
5
5
|
[project]
|
|
6
6
|
name = "storetle"
|
|
7
|
-
version = "0.2.
|
|
7
|
+
version = "0.2.2"
|
|
8
8
|
description = "HTML-aware compression for document corpora — solid-archive ratios with random access"
|
|
9
9
|
readme = "README.md"
|
|
10
10
|
requires-python = ">=3.8"
|
|
@@ -8,10 +8,11 @@
|
|
|
8
8
|
# benchmark — compare storetle vs gzip on your own data
|
|
9
9
|
|
|
10
10
|
from .stream import StreamWriter, StreamReader
|
|
11
|
+
from .remote import RemoteReader
|
|
11
12
|
from .folder import pack, unpack
|
|
12
13
|
|
|
13
|
-
__version__ = '0.2.
|
|
14
|
-
__all__ = ['StreamWriter', 'StreamReader', 'pack', 'unpack', 'benchmark']
|
|
14
|
+
__version__ = '0.2.2'
|
|
15
|
+
__all__ = ['StreamWriter', 'StreamReader', 'RemoteReader', 'pack', 'unpack', 'benchmark']
|
|
15
16
|
|
|
16
17
|
|
|
17
18
|
def benchmark(folder, quiet=False):
|
|
@@ -54,19 +54,35 @@ def cmd_pack(args):
|
|
|
54
54
|
print(f'Output: {output}')
|
|
55
55
|
|
|
56
56
|
|
|
57
|
+
def _is_url(s):
|
|
58
|
+
return s.startswith('http://') or s.startswith('https://')
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
def _open_reader(src):
|
|
62
|
+
"""Open a local path with StreamReader or a URL with RemoteReader."""
|
|
63
|
+
if _is_url(src):
|
|
64
|
+
from .remote import RemoteReader
|
|
65
|
+
return RemoteReader(src)
|
|
66
|
+
from .stream import StreamReader
|
|
67
|
+
return StreamReader(src)
|
|
68
|
+
|
|
69
|
+
|
|
57
70
|
def cmd_unpack(args):
|
|
71
|
+
text = '--text' in args
|
|
72
|
+
args = [a for a in args if a != '--text']
|
|
58
73
|
if len(args) < 2:
|
|
59
|
-
print('Usage: storetle unpack <file
|
|
74
|
+
print('Usage: storetle unpack <file-or-url> <output_folder> [--text]')
|
|
60
75
|
sys.exit(1)
|
|
61
|
-
from .stream import StreamReader
|
|
62
76
|
src = args[0]
|
|
63
77
|
dst = Path(args[1])
|
|
64
78
|
dst.mkdir(parents=True, exist_ok=True)
|
|
65
79
|
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
80
|
+
ext = 'txt' if text else 'html'
|
|
81
|
+
with _open_reader(src) as r:
|
|
82
|
+
print(f'Extracting {r.doc_count} documents to {dst}/ as .{ext}')
|
|
83
|
+
docs = r.iter_text() if text else iter(r)
|
|
84
|
+
for i, doc in enumerate(docs):
|
|
85
|
+
out = dst / f'doc_{i:06d}.{ext}'
|
|
70
86
|
out.write_bytes(doc)
|
|
71
87
|
if (i + 1) % 100 == 0:
|
|
72
88
|
print(f' {i+1}/{r.doc_count}...')
|
|
@@ -75,15 +91,20 @@ def cmd_unpack(args):
|
|
|
75
91
|
|
|
76
92
|
def cmd_info(args):
|
|
77
93
|
if not args:
|
|
78
|
-
print('Usage: storetle info <file
|
|
94
|
+
print('Usage: storetle info <file-or-url>')
|
|
79
95
|
sys.exit(1)
|
|
80
|
-
from .stream import StreamReader
|
|
81
96
|
|
|
82
97
|
def fmt(n):
|
|
83
98
|
if n < 1048576: return f'{n/1024:.1f}KB'
|
|
84
99
|
return f'{n/1048576:.2f}MB'
|
|
85
100
|
|
|
86
|
-
|
|
101
|
+
if _is_url(args[0]):
|
|
102
|
+
from .remote import RemoteReader
|
|
103
|
+
with RemoteReader(args[0]) as r:
|
|
104
|
+
info = r.info()
|
|
105
|
+
else:
|
|
106
|
+
from .stream import StreamReader
|
|
107
|
+
info = StreamReader.info(args[0])
|
|
87
108
|
print(f'\n {args[0]}')
|
|
88
109
|
print(f' Documents: {info["docs"]:,}')
|
|
89
110
|
print(f' Chunks: {info["chunks"]:,}')
|
|
@@ -93,15 +114,18 @@ def cmd_info(args):
|
|
|
93
114
|
|
|
94
115
|
|
|
95
116
|
def cmd_get(args):
|
|
117
|
+
text = '--text' in args
|
|
118
|
+
args = [a for a in args if a != '--text']
|
|
96
119
|
if len(args) < 2:
|
|
97
|
-
print('Usage: storetle get <file
|
|
120
|
+
print('Usage: storetle get <file-or-url> <index> [--text]')
|
|
98
121
|
sys.exit(1)
|
|
99
|
-
|
|
100
|
-
with StreamReader(args[0]) as r:
|
|
122
|
+
with _open_reader(args[0]) as r:
|
|
101
123
|
try:
|
|
102
124
|
idx = int(args[1])
|
|
103
|
-
doc = r[idx]
|
|
125
|
+
doc = r.get_text(idx) if text else r[idx]
|
|
104
126
|
sys.stdout.buffer.write(doc)
|
|
127
|
+
if text:
|
|
128
|
+
sys.stdout.buffer.write(b'\n')
|
|
105
129
|
except (IndexError, ValueError) as e:
|
|
106
130
|
print(f'Error: {e}')
|
|
107
131
|
sys.exit(1)
|
|
@@ -254,9 +278,11 @@ HELP = """storetle — HTML-aware compression for large document collections
|
|
|
254
278
|
Commands:
|
|
255
279
|
bench <folder> Benchmark your HTML data vs gzip WARC
|
|
256
280
|
pack <folder> <output> Compress a folder → .storetle file
|
|
257
|
-
unpack <file> <
|
|
258
|
-
info <file>
|
|
259
|
-
get <file> <index>
|
|
281
|
+
unpack <file-or-url> <out> [--text] Extract → HTML files (or clean .txt)
|
|
282
|
+
info <file-or-url> Show file statistics
|
|
283
|
+
get <file-or-url> <index> Extract one doc by index — over HTTP this
|
|
284
|
+
fetches only the containing ~2MB chunk.
|
|
285
|
+
Add --text for tag-stripped plain text
|
|
260
286
|
from-warc <input.warc[.gz]> <out> Convert WARC → .storetle
|
|
261
287
|
to-warc <input.storetle> <out> Convert .storetle → WARC (or .warc.gz)
|
|
262
288
|
warc-encode <input.warc[.gz]> <out> Encode HTML in-place → valid .warc.gz (smaller, standard format)
|
|
@@ -0,0 +1,190 @@
|
|
|
1
|
+
# remote.py — read .storetle archives over HTTP(S) with Range requests.
|
|
2
|
+
#
|
|
3
|
+
# Opens an archive with at most three small requests (footer+index tail,
|
|
4
|
+
# header, dictionary if embedded), then fetches exactly one chunk span
|
|
5
|
+
# (≤ ~2 MB compressed) per document access. Works against any server or
|
|
6
|
+
# object store that honors Range (S3, R2, GitHub Pages, nginx, ...).
|
|
7
|
+
#
|
|
8
|
+
# Stdlib only — urllib, struct.
|
|
9
|
+
|
|
10
|
+
import struct
|
|
11
|
+
import urllib.request
|
|
12
|
+
from pathlib import Path
|
|
13
|
+
|
|
14
|
+
from .stream import STREAM_MAGIC, STREAM_VERSION, _decompress, _decode_doc
|
|
15
|
+
|
|
16
|
+
_DEFAULT_DICT_PATH = Path(__file__).parent / 'cube_dict_v10.bin'
|
|
17
|
+
|
|
18
|
+
# One speculative tail fetch usually captures index + footer in a single
|
|
19
|
+
# round trip (index entries are 14 bytes; 64 KB covers ~4,600 chunks ≈
|
|
20
|
+
# 1.2M documents).
|
|
21
|
+
_TAIL_BYTES = 64 * 1024
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
class RemoteReader:
|
|
25
|
+
"""Random-access reader for a .storetle file served over HTTP(S).
|
|
26
|
+
|
|
27
|
+
Usage:
|
|
28
|
+
with RemoteReader('https://host/corpus.storetle') as r:
|
|
29
|
+
print(r.doc_count)
|
|
30
|
+
html = r[42]
|
|
31
|
+
for doc in r:
|
|
32
|
+
...
|
|
33
|
+
"""
|
|
34
|
+
|
|
35
|
+
def __init__(self, url, dictionary=None, timeout=30):
|
|
36
|
+
self._url = url
|
|
37
|
+
self._timeout = timeout
|
|
38
|
+
self._chunk_cache = (None, None) # (chunk_idx, [decoded raw docs])
|
|
39
|
+
self.bytes_fetched = 0
|
|
40
|
+
|
|
41
|
+
tail = self._fetch_suffix(_TAIL_BYTES)
|
|
42
|
+
if len(tail) < 16:
|
|
43
|
+
raise ValueError('File too small to be a .storetle archive')
|
|
44
|
+
chunk_count, index_offset = struct.unpack('>QQ', tail[-16:])
|
|
45
|
+
|
|
46
|
+
index_size = chunk_count * 14
|
|
47
|
+
if index_size + 16 <= len(tail):
|
|
48
|
+
index_raw = tail[-(index_size + 16):-16]
|
|
49
|
+
else:
|
|
50
|
+
index_raw = self._fetch(index_offset, index_offset + index_size - 1)
|
|
51
|
+
|
|
52
|
+
self._index = []
|
|
53
|
+
for i in range(chunk_count):
|
|
54
|
+
off, dc, orig = struct.unpack_from('>QHI', index_raw, i * 14)
|
|
55
|
+
self._index.append((off, dc, orig))
|
|
56
|
+
|
|
57
|
+
# chunk i occupies [offset_i, offset_{i+1}); the last ends at the index
|
|
58
|
+
self._chunk_ends = [self._index[i + 1][0] for i in range(chunk_count - 1)]
|
|
59
|
+
self._chunk_ends.append(index_offset)
|
|
60
|
+
|
|
61
|
+
# cumulative doc counts for index lookup
|
|
62
|
+
self._cum = [0]
|
|
63
|
+
for _, dc, _ in self._index:
|
|
64
|
+
self._cum.append(self._cum[-1] + dc)
|
|
65
|
+
self.doc_count = self._cum[-1]
|
|
66
|
+
|
|
67
|
+
head = self._fetch(0, 8)
|
|
68
|
+
if head[:4] != STREAM_MAGIC:
|
|
69
|
+
raise ValueError('Not a .storetle file (magic: %r)' % head[:4])
|
|
70
|
+
if head[4] != STREAM_VERSION:
|
|
71
|
+
raise ValueError('Unsupported version %d (reader is v%d)'
|
|
72
|
+
% (head[4], STREAM_VERSION))
|
|
73
|
+
dict_size = struct.unpack('>I', head[5:9])[0]
|
|
74
|
+
|
|
75
|
+
if dictionary is not None:
|
|
76
|
+
self._dict = dictionary
|
|
77
|
+
elif dict_size:
|
|
78
|
+
self._dict = self._fetch(9, 9 + dict_size - 1)
|
|
79
|
+
else:
|
|
80
|
+
self._dict = _DEFAULT_DICT_PATH.read_bytes() \
|
|
81
|
+
if _DEFAULT_DICT_PATH.exists() else b''
|
|
82
|
+
|
|
83
|
+
# -- HTTP plumbing ------------------------------------------------------
|
|
84
|
+
|
|
85
|
+
def _fetch(self, start, end):
|
|
86
|
+
return self._range_request('bytes=%d-%d' % (start, end))
|
|
87
|
+
|
|
88
|
+
def _fetch_suffix(self, n):
|
|
89
|
+
return self._range_request('bytes=-%d' % n)
|
|
90
|
+
|
|
91
|
+
def _range_request(self, range_header):
|
|
92
|
+
req = urllib.request.Request(self._url, headers={
|
|
93
|
+
'Range': range_header,
|
|
94
|
+
'User-Agent': 'storetle-remote/0.2.1',
|
|
95
|
+
})
|
|
96
|
+
with urllib.request.urlopen(req, timeout=self._timeout) as resp:
|
|
97
|
+
if resp.status not in (200, 206):
|
|
98
|
+
raise IOError('HTTP %d for %s' % (resp.status, self._url))
|
|
99
|
+
if resp.status == 200 and range_header != 'bytes=0-':
|
|
100
|
+
raise IOError(
|
|
101
|
+
'Server ignored Range request — remote access needs a '
|
|
102
|
+
'server that supports HTTP Range (got full response)')
|
|
103
|
+
data = resp.read()
|
|
104
|
+
self.bytes_fetched += len(data)
|
|
105
|
+
return data
|
|
106
|
+
|
|
107
|
+
# -- document access ----------------------------------------------------
|
|
108
|
+
|
|
109
|
+
def _load_chunk(self, ci):
|
|
110
|
+
if self._chunk_cache[0] == ci:
|
|
111
|
+
return self._chunk_cache[1]
|
|
112
|
+
off, expect_dc, _ = self._index[ci]
|
|
113
|
+
raw = self._fetch(off, self._chunk_ends[ci] - 1)
|
|
114
|
+
dc, _orig, comp_size = struct.unpack_from('>HII', raw, 0)
|
|
115
|
+
if dc != expect_dc:
|
|
116
|
+
raise ValueError('Chunk %d header disagrees with index' % ci)
|
|
117
|
+
sizes = struct.unpack_from('>%dI' % dc, raw, 10)
|
|
118
|
+
blob = _decompress(raw[10 + dc * 4: 10 + dc * 4 + comp_size], self._dict)
|
|
119
|
+
docs, pos = [], 0
|
|
120
|
+
for s in sizes:
|
|
121
|
+
docs.append(blob[pos:pos + s])
|
|
122
|
+
pos += s
|
|
123
|
+
self._chunk_cache = (ci, docs)
|
|
124
|
+
return docs
|
|
125
|
+
|
|
126
|
+
def _locate(self, idx):
|
|
127
|
+
lo, hi = 0, len(self._index) - 1
|
|
128
|
+
while lo < hi:
|
|
129
|
+
mid = (lo + hi) // 2
|
|
130
|
+
if self._cum[mid + 1] <= idx:
|
|
131
|
+
lo = mid + 1
|
|
132
|
+
else:
|
|
133
|
+
hi = mid
|
|
134
|
+
return lo
|
|
135
|
+
|
|
136
|
+
def __len__(self):
|
|
137
|
+
return self.doc_count
|
|
138
|
+
|
|
139
|
+
def __getitem__(self, idx):
|
|
140
|
+
if isinstance(idx, slice):
|
|
141
|
+
return [self[i] for i in range(*idx.indices(self.doc_count))]
|
|
142
|
+
if idx < 0:
|
|
143
|
+
idx += self.doc_count
|
|
144
|
+
if not 0 <= idx < self.doc_count:
|
|
145
|
+
raise IndexError('doc %d out of range (%d docs)' % (idx, self.doc_count))
|
|
146
|
+
ci = self._locate(idx)
|
|
147
|
+
docs = self._load_chunk(ci)
|
|
148
|
+
return _decode_doc(docs[idx - self._cum[ci]])
|
|
149
|
+
|
|
150
|
+
def __iter__(self):
|
|
151
|
+
for ci in range(len(self._index)):
|
|
152
|
+
for raw in self._load_chunk(ci):
|
|
153
|
+
yield _decode_doc(raw)
|
|
154
|
+
|
|
155
|
+
def get_text(self, idx):
|
|
156
|
+
"""Return extracted plain text (no tags) for a single document."""
|
|
157
|
+
from .text import decode_text
|
|
158
|
+
if idx < 0:
|
|
159
|
+
idx += self.doc_count
|
|
160
|
+
if not 0 <= idx < self.doc_count:
|
|
161
|
+
raise IndexError('doc %d out of range (%d docs)' % (idx, self.doc_count))
|
|
162
|
+
ci = self._locate(idx)
|
|
163
|
+
return decode_text(self._load_chunk(ci)[idx - self._cum[ci]])
|
|
164
|
+
|
|
165
|
+
def iter_text(self):
|
|
166
|
+
"""Yield extracted plain text for every document, in order."""
|
|
167
|
+
from .text import decode_text
|
|
168
|
+
for ci in range(len(self._index)):
|
|
169
|
+
for raw in self._load_chunk(ci):
|
|
170
|
+
yield decode_text(raw)
|
|
171
|
+
|
|
172
|
+
def info(self):
|
|
173
|
+
comp = self._chunk_ends[-1] - self._index[0][0] if self._index else 0
|
|
174
|
+
return {
|
|
175
|
+
'docs': self.doc_count,
|
|
176
|
+
'chunks': len(self._index),
|
|
177
|
+
'original_bytes': sum(orig for _, _, orig in self._index),
|
|
178
|
+
'compressed_bytes': comp,
|
|
179
|
+
'ratio_pct': round(100 * (1 - comp / max(1, sum(o for _, _, o in self._index))), 2),
|
|
180
|
+
}
|
|
181
|
+
|
|
182
|
+
def close(self):
|
|
183
|
+
self._chunk_cache = (None, None)
|
|
184
|
+
|
|
185
|
+
def __enter__(self):
|
|
186
|
+
return self
|
|
187
|
+
|
|
188
|
+
def __exit__(self, *exc):
|
|
189
|
+
self.close()
|
|
190
|
+
return False
|
|
@@ -426,6 +426,19 @@ class StreamReader:
|
|
|
426
426
|
raw = self._read_chunk(ci)[wi]
|
|
427
427
|
return _decode_doc(raw)
|
|
428
428
|
|
|
429
|
+
def get_text(self, doc_idx: int):
|
|
430
|
+
"""Return extracted plain text (no tags) for a single document."""
|
|
431
|
+
from .text import decode_text
|
|
432
|
+
ci, wi = self._locate(doc_idx)
|
|
433
|
+
return decode_text(self._read_chunk(ci)[wi])
|
|
434
|
+
|
|
435
|
+
def iter_text(self):
|
|
436
|
+
"""Yield extracted plain text for every document, in order."""
|
|
437
|
+
from .text import decode_text
|
|
438
|
+
for ci in range(len(self._index)):
|
|
439
|
+
for raw in self._read_chunk(ci):
|
|
440
|
+
yield decode_text(raw)
|
|
441
|
+
|
|
429
442
|
def __getitem__(self, key):
|
|
430
443
|
if isinstance(key, int):
|
|
431
444
|
return self.get(key)
|
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
# text.py — plain-text extraction straight from the NodeOp encoding.
|
|
2
|
+
#
|
|
3
|
+
# The encoded form already separates structure (struct stream) from content
|
|
4
|
+
# (content stream), so producing clean text never re-parses HTML: walk the
|
|
5
|
+
# opcodes, keep T_TEXT payloads, skip script/style bodies (T_RAWTEXT),
|
|
6
|
+
# comments and doctypes, and emit newlines at block-element boundaries.
|
|
7
|
+
#
|
|
8
|
+
# This consumes the content stream in exact lockstep with stream._decode_doc —
|
|
9
|
+
# every string the HTML decoder would read, this reads too, it just throws
|
|
10
|
+
# most of them away.
|
|
11
|
+
|
|
12
|
+
import re
|
|
13
|
+
import struct
|
|
14
|
+
|
|
15
|
+
from .decoder import _Stream
|
|
16
|
+
from .encoder import (T_OPEN, T_CLOSE, T_TEXT, T_DOCTYPE,
|
|
17
|
+
T_COMMENT, T_SELFCLOSE, T_RAWTEXT)
|
|
18
|
+
from .vocab import ID_TO_TAG, SHARED_STRINGS, UNKNOWN_ID
|
|
19
|
+
|
|
20
|
+
_BLOCK_TAGS = frozenset((
|
|
21
|
+
'p', 'div', 'br', 'li', 'ul', 'ol', 'dl', 'dt', 'dd',
|
|
22
|
+
'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
|
|
23
|
+
'table', 'tr', 'caption', 'thead', 'tbody',
|
|
24
|
+
'section', 'article', 'aside', 'header', 'footer', 'main', 'nav',
|
|
25
|
+
'blockquote', 'pre', 'figure', 'figcaption', 'hr', 'title',
|
|
26
|
+
))
|
|
27
|
+
_CELL_TAGS = frozenset(('td', 'th'))
|
|
28
|
+
|
|
29
|
+
_collapse_blank = re.compile(r'\n\s*\n+')
|
|
30
|
+
_collapse_space = re.compile(r'[ \t\f\v]+')
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
def decode_text(raw: bytes) -> bytes:
|
|
34
|
+
"""Extract plain text from one encoded document (the blob stored in a
|
|
35
|
+
chunk), without reconstructing HTML."""
|
|
36
|
+
ss_len = struct.unpack_from('>I', raw, 0)[0]
|
|
37
|
+
ss = _Stream(raw[4: 4 + ss_len])
|
|
38
|
+
cs = _Stream(raw[4 + ss_len:])
|
|
39
|
+
ss_data_len = ss_len
|
|
40
|
+
|
|
41
|
+
out = []
|
|
42
|
+
|
|
43
|
+
def boundary(tag):
|
|
44
|
+
if tag in _BLOCK_TAGS:
|
|
45
|
+
out.append('\n')
|
|
46
|
+
elif tag in _CELL_TAGS:
|
|
47
|
+
out.append('\t')
|
|
48
|
+
|
|
49
|
+
while ss._pos < ss_data_len:
|
|
50
|
+
nt = ss.read_byte()
|
|
51
|
+
|
|
52
|
+
if nt in (T_OPEN, T_SELFCLOSE):
|
|
53
|
+
tag_id = ss.read_byte()
|
|
54
|
+
tag = cs.read_string(SHARED_STRINGS) if tag_id == UNKNOWN_ID \
|
|
55
|
+
else ID_TO_TAG.get(tag_id, '')
|
|
56
|
+
ac = ss.read_byte()
|
|
57
|
+
for _ in range(ac):
|
|
58
|
+
aid = ss.read_byte()
|
|
59
|
+
if aid == UNKNOWN_ID:
|
|
60
|
+
cs.read_string(SHARED_STRINGS) # attr name — discard
|
|
61
|
+
cs.read_string(SHARED_STRINGS) # attr value — discard
|
|
62
|
+
boundary(tag)
|
|
63
|
+
|
|
64
|
+
elif nt == T_CLOSE:
|
|
65
|
+
pass # no payload; block boundary handled at open
|
|
66
|
+
|
|
67
|
+
elif nt == T_TEXT:
|
|
68
|
+
t = cs.read_string(SHARED_STRINGS)
|
|
69
|
+
if t:
|
|
70
|
+
out.append(t)
|
|
71
|
+
|
|
72
|
+
elif nt in (T_RAWTEXT, T_DOCTYPE, T_COMMENT):
|
|
73
|
+
cs.read_string(SHARED_STRINGS) # script/style/meta — discard
|
|
74
|
+
|
|
75
|
+
text = ''.join(out)
|
|
76
|
+
text = _collapse_space.sub(' ', text)
|
|
77
|
+
text = _collapse_blank.sub('\n', text)
|
|
78
|
+
return text.strip().encode('utf-8')
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: storetle
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.2
|
|
4
4
|
Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
|
|
5
5
|
Author-email: Davis Brief <davis@team8.co>
|
|
6
6
|
License: MIT
|
|
@@ -96,6 +96,53 @@ storetle to-warc archive.storetle out.warc.gz
|
|
|
96
96
|
storetle train my_corpus/ --output my.bin # domain-specific dictionary
|
|
97
97
|
```
|
|
98
98
|
|
|
99
|
+
## Hosted corpora — free
|
|
100
|
+
|
|
101
|
+
**Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
|
|
102
|
+
in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
|
|
103
|
+
JSONL metadata indexes (title ↔ doc index) and a SHA-256 manifest:
|
|
104
|
+
|
|
105
|
+
```
|
|
106
|
+
https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
|
|
110
|
+
|
|
111
|
+
```bash
|
|
112
|
+
storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 # Albert Einstein, full HTML
|
|
113
|
+
storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text # …as clean plain text
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Find a title's index by grepping the shard's `.index.jsonl`. More corpora
|
|
117
|
+
(arXiv, PubMed Central OA) coming.
|
|
118
|
+
|
|
119
|
+
## Plain text extraction (v0.2.2)
|
|
120
|
+
|
|
121
|
+
`--text` on `get`/`unpack` (and `get_text()`/`iter_text()` in the API)
|
|
122
|
+
extracts tag-stripped clean text **without re-parsing HTML** — the encoding
|
|
123
|
+
already separates structure from content, so text extraction is a walk over
|
|
124
|
+
the structure opcodes that keeps text nodes, drops script/style bodies, and
|
|
125
|
+
emits newlines at block boundaries. A 383 KB Wikipedia article becomes 39 KB
|
|
126
|
+
of readable text.
|
|
127
|
+
|
|
128
|
+
## Remote archives (v0.2.1)
|
|
129
|
+
|
|
130
|
+
`get`, `info`, and `unpack` accept URLs. Opening an archive costs a few KB
|
|
131
|
+
of Range requests; fetching a document downloads only its ~2MB chunk — no
|
|
132
|
+
server-side code, works against any Range-capable host (R2, S3, GitHub
|
|
133
|
+
Pages, nginx):
|
|
134
|
+
|
|
135
|
+
```bash
|
|
136
|
+
storetle info https://adventurelands.github.io/storetle/sample.storetle
|
|
137
|
+
storetle get https://adventurelands.github.io/storetle/sample.storetle 4
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
```python
|
|
141
|
+
from storetle import RemoteReader
|
|
142
|
+
with RemoteReader('https://host/corpus.storetle') as r:
|
|
143
|
+
html = r[42] # one ~2MB range request
|
|
144
|
+
```
|
|
145
|
+
|
|
99
146
|
## Python API
|
|
100
147
|
|
|
101
148
|
```python
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|