storetle 0.2.2__tar.gz → 0.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (25) hide show
  1. {storetle-0.2.2/storetle.egg-info → storetle-0.3.1}/PKG-INFO +22 -16
  2. {storetle-0.2.2 → storetle-0.3.1}/README.md +21 -15
  3. {storetle-0.2.2 → storetle-0.3.1}/pyproject.toml +1 -1
  4. {storetle-0.2.2 → storetle-0.3.1}/storetle/__init__.py +1 -1
  5. {storetle-0.2.2 → storetle-0.3.1}/storetle/cli.py +30 -4
  6. storetle-0.3.1/storetle/registry.py +104 -0
  7. {storetle-0.2.2 → storetle-0.3.1}/storetle/stream.py +29 -0
  8. {storetle-0.2.2 → storetle-0.3.1}/storetle/text.py +37 -8
  9. {storetle-0.2.2 → storetle-0.3.1/storetle.egg-info}/PKG-INFO +22 -16
  10. {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/SOURCES.txt +1 -0
  11. {storetle-0.2.2 → storetle-0.3.1}/LICENSE +0 -0
  12. {storetle-0.2.2 → storetle-0.3.1}/setup.cfg +0 -0
  13. {storetle-0.2.2 → storetle-0.3.1}/storetle/brotli_compat.py +0 -0
  14. {storetle-0.2.2 → storetle-0.3.1}/storetle/cube_dict_v10.bin +0 -0
  15. {storetle-0.2.2 → storetle-0.3.1}/storetle/decoder.py +0 -0
  16. {storetle-0.2.2 → storetle-0.3.1}/storetle/encoder.py +0 -0
  17. {storetle-0.2.2 → storetle-0.3.1}/storetle/folder.py +0 -0
  18. {storetle-0.2.2 → storetle-0.3.1}/storetle/remote.py +0 -0
  19. {storetle-0.2.2 → storetle-0.3.1}/storetle/vocab.py +0 -0
  20. {storetle-0.2.2 → storetle-0.3.1}/storetle/warc.py +0 -0
  21. {storetle-0.2.2 → storetle-0.3.1}/storetle/zstd_compat.py +0 -0
  22. {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/dependency_links.txt +0 -0
  23. {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/entry_points.txt +0 -0
  24. {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/requires.txt +0 -0
  25. {storetle-0.2.2 → storetle-0.3.1}/storetle.egg-info/top_level.txt +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: storetle
3
- Version: 0.2.2
3
+ Version: 0.3.1
4
4
  Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
5
5
  Author-email: Davis Brief <davis@team8.co>
6
6
  License: MIT
@@ -98,23 +98,29 @@ storetle train my_corpus/ --output my.bin # domain-specific dictionary
98
98
 
99
99
  ## Hosted corpora — free
100
100
 
101
- **Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
102
- in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
103
- JSONL metadata indexes (title doc index) and a SHA-256 manifest:
104
-
105
- ```
106
- https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
101
+ ```bash
102
+ storetle corpora # list what's available
103
+ storetle get wiki "Albert Einstein" --text # one article, by name, ~2s
104
+ storetle get wiki-text "Black hole" # from the clean-text edition
107
105
  ```
108
106
 
109
- Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
107
+ Corpus names resolve through a public registry
108
+ (`https://data.davisbrief.com/corpora.json`) — new corpora appear without a
109
+ package update. Title lookup fetches a small index once and caches it.
110
110
 
111
- ```bash
112
- storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 # Albert Einstein, full HTML
113
- storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text # …as clean plain text
114
- ```
111
+ **Available now — Simple English Wikipedia, complete** (267,503 articles,
112
+ snapshot 2025-03-20, CC-BY-SA-4.0):
113
+
114
+ | edition | size | contents |
115
+ |---|---|---|
116
+ | `wiki` | 843 MB / 6 shards | full article HTML (10.06 GB raw) |
117
+ | `wiki-text` | 196 MB / 1 file | clean plain text, random access |
118
+ | `…jsonl.zst` | 168 MB | `{"title","url","text"}` per line, for ML pipelines |
115
119
 
116
- Find a title's index by grepping the shard's `.index.jsonl`. More corpora
117
- (arXiv, PubMed Central OA) coming.
120
+ All under `https://data.davisbrief.com/simplewiki/` with JSONL metadata
121
+ indexes and a SHA-256 manifest. The entire text of Simple English Wikipedia
122
+ in 196 MB, where any article is one ~2 MB range request away — that's the
123
+ point of the format. More corpora (arXiv, PubMed Central OA) coming.
118
124
 
119
125
  ## Plain text extraction (v0.2.2)
120
126
 
@@ -133,8 +139,8 @@ server-side code, works against any Range-capable host (R2, S3, GitHub
133
139
  Pages, nginx):
134
140
 
135
141
  ```bash
136
- storetle info https://adventurelands.github.io/storetle/sample.storetle
137
- storetle get https://adventurelands.github.io/storetle/sample.storetle 4
142
+ storetle info https://data.davisbrief.com/simplewiki/simplewiki-text-20250320.storetle
143
+ storetle get wiki "Albert Einstein" --text
138
144
  ```
139
145
 
140
146
  ```python
@@ -75,23 +75,29 @@ storetle train my_corpus/ --output my.bin # domain-specific dictionary
75
75
 
76
76
  ## Hosted corpora — free
77
77
 
78
- **Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
79
- in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
80
- JSONL metadata indexes (title doc index) and a SHA-256 manifest:
81
-
82
- ```
83
- https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
78
+ ```bash
79
+ storetle corpora # list what's available
80
+ storetle get wiki "Albert Einstein" --text # one article, by name, ~2s
81
+ storetle get wiki-text "Black hole" # from the clean-text edition
84
82
  ```
85
83
 
86
- Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
84
+ Corpus names resolve through a public registry
85
+ (`https://data.davisbrief.com/corpora.json`) — new corpora appear without a
86
+ package update. Title lookup fetches a small index once and caches it.
87
87
 
88
- ```bash
89
- storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 # Albert Einstein, full HTML
90
- storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text # …as clean plain text
91
- ```
88
+ **Available now — Simple English Wikipedia, complete** (267,503 articles,
89
+ snapshot 2025-03-20, CC-BY-SA-4.0):
90
+
91
+ | edition | size | contents |
92
+ |---|---|---|
93
+ | `wiki` | 843 MB / 6 shards | full article HTML (10.06 GB raw) |
94
+ | `wiki-text` | 196 MB / 1 file | clean plain text, random access |
95
+ | `…jsonl.zst` | 168 MB | `{"title","url","text"}` per line, for ML pipelines |
92
96
 
93
- Find a title's index by grepping the shard's `.index.jsonl`. More corpora
94
- (arXiv, PubMed Central OA) coming.
97
+ All under `https://data.davisbrief.com/simplewiki/` with JSONL metadata
98
+ indexes and a SHA-256 manifest. The entire text of Simple English Wikipedia
99
+ in 196 MB, where any article is one ~2 MB range request away — that's the
100
+ point of the format. More corpora (arXiv, PubMed Central OA) coming.
95
101
 
96
102
  ## Plain text extraction (v0.2.2)
97
103
 
@@ -110,8 +116,8 @@ server-side code, works against any Range-capable host (R2, S3, GitHub
110
116
  Pages, nginx):
111
117
 
112
118
  ```bash
113
- storetle info https://adventurelands.github.io/storetle/sample.storetle
114
- storetle get https://adventurelands.github.io/storetle/sample.storetle 4
119
+ storetle info https://data.davisbrief.com/simplewiki/simplewiki-text-20250320.storetle
120
+ storetle get wiki "Albert Einstein" --text
115
121
  ```
116
122
 
117
123
  ```python
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "storetle"
7
- version = "0.2.2"
7
+ version = "0.3.1"
8
8
  description = "HTML-aware compression for document corpora — solid-archive ratios with random access"
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.8"
@@ -11,7 +11,7 @@ from .stream import StreamWriter, StreamReader
11
11
  from .remote import RemoteReader
12
12
  from .folder import pack, unpack
13
13
 
14
- __version__ = '0.2.2'
14
+ __version__ = '0.3.1'
15
15
  __all__ = ['StreamWriter', 'StreamReader', 'RemoteReader', 'pack', 'unpack', 'benchmark']
16
16
 
17
17
 
@@ -117,11 +117,23 @@ def cmd_get(args):
117
117
  text = '--text' in args
118
118
  args = [a for a in args if a != '--text']
119
119
  if len(args) < 2:
120
- print('Usage: storetle get <file-or-url> <index> [--text]')
120
+ print('Usage: storetle get <file|url|corpus> <index|title> [--text]\n'
121
+ ' storetle get wiki "Albert Einstein" --text')
121
122
  sys.exit(1)
122
- with _open_reader(args[0]) as r:
123
+
124
+ src, ref = args[0], ' '.join(args[1:])
125
+ if not _is_url(src) and not Path(src).exists():
126
+ # treat as a named corpus from the public registry
127
+ from .registry import resolve
128
+ try:
129
+ src, ref = resolve(src, ref)
130
+ except (KeyError, IndexError) as e:
131
+ print(f'Error: {e}')
132
+ sys.exit(1)
133
+
134
+ with _open_reader(src) as r:
123
135
  try:
124
- idx = int(args[1])
136
+ idx = int(ref)
125
137
  doc = r.get_text(idx) if text else r[idx]
126
138
  sys.stdout.buffer.write(doc)
127
139
  if text:
@@ -131,6 +143,17 @@ def cmd_get(args):
131
143
  sys.exit(1)
132
144
 
133
145
 
146
+ def cmd_corpora(args):
147
+ from .registry import list_corpora
148
+ print()
149
+ for name, info in list_corpora().items():
150
+ print(f' {name:12s} {info.get("title","")} '
151
+ f'[{info.get("docs","?"):,} docs, {info.get("snapshot","")}, '
152
+ f'{info.get("license","")}]')
153
+ print('\n Usage: storetle get <corpus> <title-or-index> [--text]')
154
+ print()
155
+
156
+
134
157
  def cmd_from_warc(args):
135
158
  if len(args) < 2:
136
159
  print('Usage: storetle from-warc <input.warc[.gz]> <output.storetle>')
@@ -262,6 +285,7 @@ def cmd_warc_decode(args):
262
285
 
263
286
  COMMANDS = {
264
287
  'bench': cmd_bench,
288
+ 'corpora': cmd_corpora,
265
289
  'pack': cmd_pack,
266
290
  'unpack': cmd_unpack,
267
291
  'info': cmd_info,
@@ -276,11 +300,13 @@ COMMANDS = {
276
300
  HELP = """storetle — HTML-aware compression for large document collections
277
301
 
278
302
  Commands:
303
+ corpora List free hosted corpora
279
304
  bench <folder> Benchmark your HTML data vs gzip WARC
280
305
  pack <folder> <output> Compress a folder → .storetle file
281
306
  unpack <file-or-url> <out> [--text] Extract → HTML files (or clean .txt)
282
307
  info <file-or-url> Show file statistics
283
- get <file-or-url> <index> Extract one doc by index over HTTP this
308
+ get <file|url|corpus> <ref> Extract one doc by index or title remote
309
+ reads fetch only the containing ~2MB chunk.
284
310
  fetches only the containing ~2MB chunk.
285
311
  Add --text for tag-stripped plain text
286
312
  from-warc <input.warc[.gz]> <out> Convert WARC → .storetle
@@ -0,0 +1,104 @@
1
+ # registry.py — named corpora: `storetle get wiki "Albert Einstein"`.
2
+ #
3
+ # A corpus registry (corpora.json) lives next to the hosted data, so new
4
+ # corpora appear without a new package release. Title→location maps are
5
+ # fetched once and cached under ~/.cache/storetle/.
6
+
7
+ import gzip
8
+ import json
9
+ import time
10
+ import urllib.request
11
+ from pathlib import Path
12
+
13
+ REGISTRY_URL = 'https://data.davisbrief.com/corpora.json'
14
+ CACHE_DIR = Path.home() / '.cache' / 'storetle'
15
+ _REGISTRY_TTL = 3600
16
+
17
+
18
+ def _fetch(url, timeout=30):
19
+ req = urllib.request.Request(url, headers={'User-Agent': 'storetle-cli'})
20
+ with urllib.request.urlopen(req, timeout=timeout) as r:
21
+ return r.read()
22
+
23
+
24
+ def load_registry():
25
+ """Fetch corpora.json, with a small on-disk cache."""
26
+ CACHE_DIR.mkdir(parents=True, exist_ok=True)
27
+ cache = CACHE_DIR / 'corpora.json'
28
+ if cache.exists() and time.time() - cache.stat().st_mtime < _REGISTRY_TTL:
29
+ return json.loads(cache.read_text())
30
+ try:
31
+ data = _fetch(REGISTRY_URL)
32
+ cache.write_bytes(data)
33
+ return json.loads(data)
34
+ except Exception:
35
+ if cache.exists(): # stale beats broken
36
+ return json.loads(cache.read_text())
37
+ raise
38
+
39
+
40
+ def _titles_path(corpus_name, entry):
41
+ """Download (once) and cache the corpus title map."""
42
+ CACHE_DIR.mkdir(parents=True, exist_ok=True)
43
+ local = CACHE_DIR / f'titles-{corpus_name}.tsv.gz'
44
+ if not local.exists():
45
+ url = entry['base'].rstrip('/') + '/' + entry['titles']
46
+ print(f'[storetle] fetching title index for "{corpus_name}" '
47
+ f'({url.rsplit("/",1)[-1]}, one-time)...')
48
+ local.write_bytes(_fetch(url))
49
+ return local
50
+
51
+
52
+ def _lookup_title(corpus_name, entry, title):
53
+ """Resolve a title to (shard_no, doc_idx). Exact, then case-insensitive."""
54
+ want = title.strip()
55
+ want_ci = want.lower()
56
+ ci_hit = None
57
+ with gzip.open(_titles_path(corpus_name, entry), 'rt') as f:
58
+ for line in f:
59
+ name, shard, idx = line.rstrip('\n').rsplit('\t', 2)
60
+ if name == want:
61
+ return int(shard), int(idx)
62
+ if ci_hit is None and name.lower() == want_ci:
63
+ ci_hit = (int(shard), int(idx))
64
+ if ci_hit:
65
+ return ci_hit
66
+ raise KeyError(f'title not found in corpus "{corpus_name}": {title!r}')
67
+
68
+
69
+ def resolve(corpus_name, ref):
70
+ """Resolve (corpus, index-or-title) → (shard_url, doc_idx).
71
+
72
+ ref may be an integer global index or a document title.
73
+ """
74
+ reg = load_registry()
75
+ if corpus_name not in reg:
76
+ raise KeyError(f'unknown corpus {corpus_name!r}; '
77
+ f'available: {", ".join(sorted(reg))}')
78
+ entry = reg[corpus_name]
79
+ base = entry['base'].rstrip('/')
80
+ shards = entry['shards']
81
+
82
+ try:
83
+ gidx = int(ref)
84
+ except (TypeError, ValueError):
85
+ shard_no, idx = _lookup_title(corpus_name, entry, ref)
86
+ return f'{base}/{shards[shard_no]}', idx
87
+
88
+ # integer: map global index onto shards via per-shard doc counts
89
+ counts = entry.get('shard_docs') or []
90
+ if not counts:
91
+ return f'{base}/{shards[0]}', gidx
92
+ run = 0
93
+ for shard_no, n in enumerate(counts):
94
+ if gidx < run + n:
95
+ return f'{base}/{shards[shard_no]}', gidx - run
96
+ run += n
97
+ raise IndexError(f'index {gidx} out of range ({run} docs in corpus)')
98
+
99
+
100
+ def list_corpora():
101
+ reg = load_registry()
102
+ return {name: {k: v for k, v in e.items() if k in
103
+ ('title', 'docs', 'snapshot', 'license')}
104
+ for name, e in sorted(reg.items())}
@@ -84,6 +84,18 @@ def _encode_doc(html):
84
84
  return struct.pack('>I', len(ss)) + ss + cs
85
85
 
86
86
 
87
+ def _encode_text_doc(text):
88
+ """Encode a plain-text document as a single text node.
89
+
90
+ Same container, same decoders: HTML readers see the text (escaped),
91
+ --text / get_text() return it verbatim. This is how text-mode corpora
92
+ (e.g. clean-text Wikipedia) are stored without any format change."""
93
+ if isinstance(text, bytes):
94
+ text = text.decode('utf-8', errors='replace')
95
+ ss, cs = _build_streams_class_split([(T_TEXT, None, text)])
96
+ return struct.pack('>I', len(ss)) + ss + cs
97
+
98
+
87
99
  def _decode_doc(raw):
88
100
  """Decode a v2 blob back to reconstructed HTML bytes."""
89
101
  from .decoder import _Stream
@@ -229,6 +241,23 @@ class StreamWriter:
229
241
  else:
230
242
  self._append_sync(html)
231
243
 
244
+ def append_text(self, text):
245
+ """Encode and buffer one plain-text document (no HTML parsing)."""
246
+ if self._workers > 1:
247
+ # preserve document order: settle in-flight HTML encodes first
248
+ self._drain_all()
249
+ if isinstance(text, str):
250
+ data = text.encode('utf-8', errors='replace')
251
+ else:
252
+ data = text
253
+ self._total_orig += len(data)
254
+ raw = _encode_text_doc(data)
255
+ self._chunk_buf.append(raw)
256
+ self._chunk_bytes += len(raw)
257
+ self._total_docs += 1
258
+ if len(self._chunk_buf) >= CHUNK_DOCS or self._chunk_bytes >= CHUNK_BYTES:
259
+ self._flush_chunk()
260
+
232
261
  def _append_sync(self, html):
233
262
  if isinstance(html, str):
234
263
  try:
@@ -15,7 +15,7 @@ import struct
15
15
  from .decoder import _Stream
16
16
  from .encoder import (T_OPEN, T_CLOSE, T_TEXT, T_DOCTYPE,
17
17
  T_COMMENT, T_SELFCLOSE, T_RAWTEXT)
18
- from .vocab import ID_TO_TAG, SHARED_STRINGS, UNKNOWN_ID
18
+ from .vocab import ID_TO_TAG, ID_TO_ATTR, SHARED_STRINGS, UNKNOWN_ID
19
19
 
20
20
  _BLOCK_TAGS = frozenset((
21
21
  'p', 'div', 'br', 'li', 'ul', 'ol', 'dl', 'dt', 'dd',
@@ -29,6 +29,25 @@ _CELL_TAGS = frozenset(('td', 'th'))
29
29
  _collapse_blank = re.compile(r'\n\s*\n+')
30
30
  _collapse_space = re.compile(r'[ \t\f\v]+')
31
31
 
32
+ # Elements whose entire subtree is navigation/boilerplate, not content.
33
+ # Matched against class tokens (substring) and role attribute values.
34
+ _SKIP_CLASS_SUBSTR = ('navbox', 'catlinks', 'mw-jump', 'printfooter',
35
+ 'mw-editsection', 'breadcrumb', 'site-nav')
36
+ _SKIP_ROLES = ('navigation',)
37
+
38
+
39
+ def _is_boilerplate(attrs):
40
+ for name, value in attrs:
41
+ if not value:
42
+ continue
43
+ if name == 'role' and value.lower() in _SKIP_ROLES:
44
+ return True
45
+ if name == 'class':
46
+ v = value.lower()
47
+ if any(s in v for s in _SKIP_CLASS_SUBSTR):
48
+ return True
49
+ return False
50
+
32
51
 
33
52
  def decode_text(raw: bytes) -> bytes:
34
53
  """Extract plain text from one encoded document (the blob stored in a
@@ -39,6 +58,8 @@ def decode_text(raw: bytes) -> bytes:
39
58
  ss_data_len = ss_len
40
59
 
41
60
  out = []
61
+ depth = 0
62
+ skip_above = None # while set, drop text until depth returns here
42
63
 
43
64
  def boundary(tag):
44
65
  if tag in _BLOCK_TAGS:
@@ -54,19 +75,27 @@ def decode_text(raw: bytes) -> bytes:
54
75
  tag = cs.read_string(SHARED_STRINGS) if tag_id == UNKNOWN_ID \
55
76
  else ID_TO_TAG.get(tag_id, '')
56
77
  ac = ss.read_byte()
78
+ attrs = []
57
79
  for _ in range(ac):
58
80
  aid = ss.read_byte()
59
- if aid == UNKNOWN_ID:
60
- cs.read_string(SHARED_STRINGS) # attr name — discard
61
- cs.read_string(SHARED_STRINGS) # attr value — discard
62
- boundary(tag)
81
+ aname = cs.read_string(SHARED_STRINGS) if aid == UNKNOWN_ID \
82
+ else ID_TO_ATTR.get(aid, '')
83
+ attrs.append((aname, cs.read_string(SHARED_STRINGS)))
84
+ if nt == T_OPEN:
85
+ depth += 1
86
+ if skip_above is None and _is_boilerplate(attrs):
87
+ skip_above = depth - 1
88
+ if skip_above is None:
89
+ boundary(tag)
63
90
 
64
91
  elif nt == T_CLOSE:
65
- pass # no payload; block boundary handled at open
92
+ depth = max(0, depth - 1)
93
+ if skip_above is not None and depth <= skip_above:
94
+ skip_above = None
66
95
 
67
96
  elif nt == T_TEXT:
68
97
  t = cs.read_string(SHARED_STRINGS)
69
- if t:
98
+ if t and skip_above is None:
70
99
  out.append(t)
71
100
 
72
101
  elif nt in (T_RAWTEXT, T_DOCTYPE, T_COMMENT):
@@ -74,5 +103,5 @@ def decode_text(raw: bytes) -> bytes:
74
103
 
75
104
  text = ''.join(out)
76
105
  text = _collapse_space.sub(' ', text)
77
- text = _collapse_blank.sub('\n', text)
106
+ text = _collapse_blank.sub('\n\n', text) # keep paragraph boundaries
78
107
  return text.strip().encode('utf-8')
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: storetle
3
- Version: 0.2.2
3
+ Version: 0.3.1
4
4
  Summary: HTML-aware compression for document corpora — solid-archive ratios with random access
5
5
  Author-email: Davis Brief <davis@team8.co>
6
6
  License: MIT
@@ -98,23 +98,29 @@ storetle train my_corpus/ --output my.bin # domain-specific dictionary
98
98
 
99
99
  ## Hosted corpora — free
100
100
 
101
- **Simple English Wikipedia, complete** — 267,503 articles, 10.06 GB of HTML
102
- in 843 MB, snapshot 2025-03-20, CC-BY-SA-4.0. Six self-contained shards with
103
- JSONL metadata indexes (title doc index) and a SHA-256 manifest:
104
-
105
- ```
106
- https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/manifest.json
101
+ ```bash
102
+ storetle corpora # list what's available
103
+ storetle get wiki "Albert Einstein" --text # one article, by name, ~2s
104
+ storetle get wiki-text "Black hole" # from the clean-text edition
107
105
  ```
108
106
 
109
- Pull one article out of a 100+ MB shard, by index, in ~2 seconds:
107
+ Corpus names resolve through a public registry
108
+ (`https://data.davisbrief.com/corpora.json`) — new corpora appear without a
109
+ package update. Title lookup fetches a small index once and caches it.
110
110
 
111
- ```bash
112
- storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 # Albert Einstein, full HTML
113
- storetle get https://pub-0a9a18b1320f46f794f8374a71aa608b.r2.dev/simplewiki/simplewiki-20250320-0005.storetle 11244 --text # …as clean plain text
114
- ```
111
+ **Available now — Simple English Wikipedia, complete** (267,503 articles,
112
+ snapshot 2025-03-20, CC-BY-SA-4.0):
113
+
114
+ | edition | size | contents |
115
+ |---|---|---|
116
+ | `wiki` | 843 MB / 6 shards | full article HTML (10.06 GB raw) |
117
+ | `wiki-text` | 196 MB / 1 file | clean plain text, random access |
118
+ | `…jsonl.zst` | 168 MB | `{"title","url","text"}` per line, for ML pipelines |
115
119
 
116
- Find a title's index by grepping the shard's `.index.jsonl`. More corpora
117
- (arXiv, PubMed Central OA) coming.
120
+ All under `https://data.davisbrief.com/simplewiki/` with JSONL metadata
121
+ indexes and a SHA-256 manifest. The entire text of Simple English Wikipedia
122
+ in 196 MB, where any article is one ~2 MB range request away — that's the
123
+ point of the format. More corpora (arXiv, PubMed Central OA) coming.
118
124
 
119
125
  ## Plain text extraction (v0.2.2)
120
126
 
@@ -133,8 +139,8 @@ server-side code, works against any Range-capable host (R2, S3, GitHub
133
139
  Pages, nginx):
134
140
 
135
141
  ```bash
136
- storetle info https://adventurelands.github.io/storetle/sample.storetle
137
- storetle get https://adventurelands.github.io/storetle/sample.storetle 4
142
+ storetle info https://data.davisbrief.com/simplewiki/simplewiki-text-20250320.storetle
143
+ storetle get wiki "Albert Einstein" --text
138
144
  ```
139
145
 
140
146
  ```python
@@ -8,6 +8,7 @@ storetle/cube_dict_v10.bin
8
8
  storetle/decoder.py
9
9
  storetle/encoder.py
10
10
  storetle/folder.py
11
+ storetle/registry.py
11
12
  storetle/remote.py
12
13
  storetle/stream.py
13
14
  storetle/text.py
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes