pst-search 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 pst-search contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,15 @@
1
+ include README.md
2
+ include LICENSE
3
+ include THIRD_PARTY_LICENSES.md
4
+ include pyproject.toml
5
+
6
+ recursive-include pst_search/web *
7
+ recursive-include pst_search/node *.mjs
8
+ include pst_search/node/package.json
9
+ include pst_search/node/package-lock.json
10
+
11
+ # Keep build output and runtime artifacts out of the sdist.
12
+ prune pst_search/node/node_modules
13
+ global-exclude __pycache__
14
+ global-exclude *.py[co]
15
+ global-exclude .DS_Store
@@ -0,0 +1,296 @@
1
+ Metadata-Version: 2.4
2
+ Name: pst-search
3
+ Version: 0.1.0
4
+ Summary: Local search engine for Outlook PST files. Index once, search instantly, retrieve attachments on demand.
5
+ Author-email: KD5RYN <jlacy8234@gmail.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/KD5RYN/pst-search
8
+ Project-URL: Repository, https://github.com/KD5RYN/pst-search
9
+ Project-URL: Issues, https://github.com/KD5RYN/pst-search/issues
10
+ Keywords: pst,outlook,email,eml,full-text-search,sqlite,fts5,ediscovery,forensics,libpff
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Environment :: Console
13
+ Classifier: Environment :: Web Environment
14
+ Classifier: Intended Audience :: End Users/Desktop
15
+ Classifier: Intended Audience :: Legal Industry
16
+ Classifier: Intended Audience :: Information Technology
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Operating System :: Microsoft :: Windows
19
+ Classifier: Operating System :: POSIX :: Linux
20
+ Classifier: Operating System :: MacOS
21
+ Classifier: Programming Language :: Python :: 3
22
+ Classifier: Programming Language :: Python :: 3.10
23
+ Classifier: Programming Language :: Python :: 3.11
24
+ Classifier: Programming Language :: Python :: 3.12
25
+ Classifier: Programming Language :: Python :: 3.13
26
+ Classifier: Topic :: Communications :: Email
27
+ Classifier: Topic :: Communications :: Email :: Filters
28
+ Classifier: Topic :: Office/Business
29
+ Classifier: Topic :: System :: Archiving
30
+ Classifier: Topic :: Database
31
+ Requires-Python: >=3.10
32
+ Description-Content-Type: text/markdown
33
+ License-File: LICENSE
34
+ Requires-Dist: fastapi>=0.110
35
+ Requires-Dist: uvicorn[standard]>=0.27
36
+ Requires-Dist: click>=8.1
37
+ Provides-Extra: dev
38
+ Requires-Dist: pytest>=8; extra == "dev"
39
+ Requires-Dist: build>=1.0; extra == "dev"
40
+ Requires-Dist: twine>=5.0; extra == "dev"
41
+ Dynamic: license-file
42
+
43
+ # PST Search
44
+
45
+ A local search engine for Outlook PST files. Index once, then search by subject, body, sender, recipients, folder, or date — and pull attachments directly from the source PST on demand. Built around SQLite FTS5 for instant full-text search.
46
+
47
+ ![PST Search UI — folder tree, search results, message detail](docs/screenshot.png)
48
+
49
+ ## What it's for
50
+
51
+ `pst-search` is for anyone who has one or more Outlook `.pst` files and needs to actually search through them — without installing Outlook, without uploading the archive to a cloud service, and without writing throwaway scripts.
52
+
53
+ Common situations it solves:
54
+
55
+ - **Old mailbox archives.** Email from a previous job, a retired account, or a long-running personal mailbox you've exported. The PST is sitting on a drive somewhere and you want to find things in it.
56
+ - **Recovery and lookup.** Someone hands you a `.pst` and asks "is there an email about X?" or "find everything from Bob in 2024." You point the tool at it, the tool tells you.
57
+ - **Forwarding specific messages.** Pull a single old email out as a standard `.eml` file (with all its attachments) and drop it into any mail client to forward, archive, or attach to a ticket.
58
+ - **Forensic, discovery, or compliance work.** Structured search across folders, senders, recipients, attachments, and date ranges. Multi-PST library so you can index a stack of archives and search them together.
59
+ - **Privacy-conscious search.** Everything runs on `127.0.0.1`. No data leaves the machine, no account required, the PST file is never uploaded anywhere.
60
+
61
+ ### Why a new tool?
62
+
63
+ Most Python tooling for PSTs sits on top of `libpff`, which has a long-standing unfixed parsing bug (`libpff_table_read: invalid table - missing data identifier`) that makes it unable to read certain real-world PSTs — particularly those exported by recent versions of Outlook. We hit that on a real 8 GB mailbox where libpff couldn't read a single message. `pst-search` routes through an independent codebase (`pst-extractor`, a Node.js port of `java-libpst`), so PSTs that defeat libpff still open here.
64
+
65
+ ## Features
66
+
67
+ - **Full-text search** across subject, body, sender, recipients, and folder path. FTS5-ranked, with `<mark>`-highlighted snippets.
68
+ - **Gmail-style operators** in the search box: `from:bob`, `to:alice`, `subject:budget`, `body:meeting`, `folder:inbox`, combined with `AND`/`OR`/`NOT`, quoted phrases, prefix matching (`meet*`), and parentheses. Click the **?** next to the search box for the full cheatsheet.
69
+ - **Browse mode** — leave the search box empty to list messages newest-first; click any folder in the tree to filter to it.
70
+ - **Sort by date or relevance** — dropdown in the result-list header switches between Newest first (default), Oldest first, and Relevance (BM25 ranking for search queries).
71
+ - **Filters**: from, to, folder, date range, has-attachments.
72
+ - **Lazy attachments**: the index stores only filenames and sizes. Clicking an attachment re-opens the PST and extracts that one file on demand. No multi-GB attachment dump on disk.
73
+ - **Export to `.eml`**: every message has a Download button that produces a standard RFC 5322 `.eml` file with headers, body (plain + HTML), and all attachments. Opens in Outlook, Thunderbird, Apple Mail, or any webmail upload.
74
+ - **Multiple PSTs** in one index. Re-indexing a PST replaces its rows in place.
75
+ - **Local-only**: everything runs on `127.0.0.1`. No data leaves your machine.
76
+
77
+ ## Requirements
78
+
79
+ `pst-search` needs **two runtimes** on the machine before it can do anything useful:
80
+
81
+ - **Python 3.10 or newer** — the indexer, search API, and CLI.
82
+ - **Node.js 18 or newer** — the PST parser (`pst-extractor`) runs as a Node subprocess. `npm` ships with Node.
83
+
84
+ | | Windows | macOS | Linux (Ubuntu/Debian) |
85
+ | --- | --- | --- | --- |
86
+ | Python 3.10+ | `winget install Python.Python.3.12` | `brew install python@3.12` | `sudo apt install python3 python3-pip python3-tk python3-venv` |
87
+ | Node.js 18+ | `winget install OpenJS.NodeJS.LTS` | `brew install node` | `sudo apt install nodejs npm` |
88
+
89
+ > **macOS Homebrew users**: also run `brew install python-tk@3.12` so the file picker dialog works. (The python.org installer includes it already.)
90
+
91
+ ## Quick start
92
+
93
+ ### Install from PyPI
94
+
95
+ ```bash
96
+ pip install pst-search
97
+ pstsearch setup # one-time: pulls down the Node-side pst-extractor library
98
+ pstsearch serve
99
+ ```
100
+
101
+ `pstsearch setup` is a thin wrapper around `npm install` for the bundled Node helper. If you skip it, the first indexing run will install the dependencies for you automatically.
102
+
103
+ ### …or install from source
104
+
105
+ ```bash
106
+ # If you have git installed:
107
+ git clone https://github.com/KD5RYN/pst-search
108
+ cd pst-search
109
+ pip install -e .
110
+ (cd pst_search/node && npm install)
111
+ pstsearch serve
112
+ ```
113
+
114
+ …or **download as a ZIP** from <https://github.com/KD5RYN/pst-search> (green **Code** button → **Download ZIP**), then unzip, `cd` into the folder, and run the same two install commands.
115
+
116
+ On Windows PowerShell the second install line is:
117
+
118
+ ```pwsh
119
+ cd pst_search\node; npm install; cd ..\..
120
+ ```
121
+
122
+ ### Run
123
+
124
+ ```bash
125
+ pstsearch serve
126
+ ```
127
+
128
+ A browser tab opens at <http://127.0.0.1:8765>.
129
+
130
+ 1. Click **📁 Manage PSTs → + Add another PST**
131
+ 2. Pick your `.pst` file in the native dialog
132
+ 3. **Adjust indexing options** (or accept defaults) and click **Start indexing**
133
+ 4. Search as soon as the first batch lands; the rest streams in behind you
134
+
135
+ When it's done, search.
136
+
137
+ The search index lives in your per-user data directory:
138
+
139
+ - Windows: `%APPDATA%\pst-search\index.db`
140
+ - macOS: `~/Library/Application Support/pst-search/index.db`
141
+ - Linux: `$XDG_DATA_HOME/pst-search/index.db` (default `~/.local/share/pst-search/index.db`)
142
+
143
+ Delete that file to wipe the index and start over.
144
+
145
+ ## Search syntax
146
+
147
+ The search box accepts the same operators most users already know from Gmail and Outlook, plus all of SQLite FTS5's native query language.
148
+
149
+ **Operators:**
150
+
151
+ | Type | Means |
152
+ | --- | --- |
153
+ | `from:bob` | sender name or email contains "bob" |
154
+ | `to:alice` | any recipient (To/Cc/Bcc) contains "alice" |
155
+ | `subject:budget` | match restricted to the subject |
156
+ | `body:meeting` | match restricted to the body |
157
+ | `folder:inbox` | folder path contains "inbox" |
158
+ | `cc:` / `bcc:` | recipients (we don't distinguish To/Cc/Bcc) |
159
+
160
+ **Combining:**
161
+
162
+ | Form | Means |
163
+ | --- | --- |
164
+ | `a b` | both words present (implicit AND) |
165
+ | `a AND b` | both — explicit |
166
+ | `a OR b` | either |
167
+ | `a NOT b` | a but not b |
168
+ | `"q4 plan"` | exact phrase |
169
+ | `meet*` | prefix — matches meeting, meetup, meets, … |
170
+ | `(a OR b) AND c` | group with parens |
171
+
172
+ **Example:** `from:bob AND subject:budget NOT folder:trash` — emails from Bob about budgets that aren't in any trash folder.
173
+
174
+ Click the **?** icon at the right edge of the search box for a popup version of this cheatsheet.
175
+
176
+ ## Indexing options
177
+
178
+ The "Add a PST" dialog and the `pstsearch index` CLI command both expose the same three knobs. Defaults work for almost every mailbox; tweak them only when the defaults don't fit your data.
179
+
180
+ | Option | GUI label | CLI flag | Default | When to change |
181
+ | --- | --- | --- | --- | --- |
182
+ | Include message bodies | _Index message bodies_ (checkbox) | `--no-body` | on | Off for **huge archives** when you only need to search by subject/sender — indexing becomes dramatically faster. |
183
+ | Max body length kept | _Max body length per message_ | `--body-cap KB` | 32 KB | Raise (up to 1024 KB) if your real-content emails routinely run longer; lower to shrink the index. |
184
+ | Skip body for very large messages | _Skip bodies larger than_ | `--max-html-fetch MB` | 4 MB | Lower if you want to ignore giant newsletter-style mail; raise toward 100 MB if you specifically want body text from huge messages too. |
185
+
186
+ Open the **Advanced options** disclosure in the Add-PST dialog to see and adjust the last two.
187
+
188
+ ## App settings (⚙️ button)
189
+
190
+ Click the gear icon in the header to see what the server is currently doing:
191
+
192
+ - **Listening at** — the URL the server is bound to
193
+ - **Network access** — confirms whether you're local-only or exposed
194
+ - **Index database** — where the SQLite file lives, with an **Open data folder** button
195
+
196
+ These are read-only because changing them requires restarting the server. To change them, pass flags to `pstsearch serve` (see below).
197
+
198
+ ## Commands
199
+
200
+ ```
201
+ pstsearch serve [--host HOST] [--port PORT] [--db PATH] [--no-browser]
202
+ Launch the web UI. Defaults: --host 127.0.0.1 --port 8765.
203
+ Pass --host 0.0.0.0 to expose to your LAN (DO NOT do this on an
204
+ untrusted network — anyone reaching the port can search your mail).
205
+
206
+ pstsearch index FILE.pst
207
+ [--no-body]
208
+ [--body-cap KB]
209
+ [--max-html-fetch MB]
210
+ [--db PATH]
211
+ Index a PST from the command line. Re-running on the same file
212
+ replaces its rows. Options mirror the GUI Add-PST dialog.
213
+
214
+ pstsearch list
215
+ Show indexed PSTs (id, message count, path, indexed-at).
216
+
217
+ pstsearch setup
218
+ One-time install of the Node-side dependencies (pst-extractor and friends).
219
+ Safe to re-run. Indexing will auto-bootstrap these on first use if you
220
+ forget, so this command is mostly for users who want the install to
221
+ happen up front rather than the first time they hit "Index".
222
+ ```
223
+
224
+ ## Architecture
225
+
226
+ ```
227
+ PST file --[Node + pst-extractor]--NDJSON--> Python indexer --[SQLite + FTS5]--> Search API --[HTML/JS]--> Browser
228
+ |
229
+ (on attachment click,
230
+ spawn Node, extract
231
+ one attachment by
232
+ descriptor node ID)
233
+ ```
234
+
235
+ | Layer | File | Purpose |
236
+ | --- | --- | --- |
237
+ | PST extractor | `pst_search/node/extract.mjs` | Walks the PST with `pst-extractor`, streams one NDJSON record per message to stdout. |
238
+ | Attachment extractor | `pst_search/node/attachment.mjs` | Pulls a single attachment's bytes from a PST by descriptor node ID. |
239
+ | Message dump | `pst_search/node/message.mjs` | Full message export (headers + both body forms + every attachment) for `.eml` building. |
240
+ | Python driver | `pst_search/pst.py` | Spawns Node, parses NDJSON, exposes Python iterators, attachment fetch, and full-message export. |
241
+ | Indexer | `pst_search/indexer.py` | Consumes the message stream and bulk-inserts into SQLite. |
242
+ | Indexing jobs | `pst_search/jobs.py` | Background indexing thread + job registry. Lets the web UI fire off a scan and poll for progress. |
243
+ | Database | `pst_search/db.py` | Schema + FTS5 virtual table + search/browse queries + Gmail-style operator translation. |
244
+ | Server | `pst_search/server.py` | FastAPI endpoints — see below. |
245
+ | Web UI | `pst_search/web/index.html` | Single-file frontend (HTML + inline CSS + JS), no build step. |
246
+ | CLI | `pst_search/cli.py` | `index` / `serve` / `list` entry points. |
247
+
248
+ **HTTP API:**
249
+
250
+ | Endpoint | Method | Purpose |
251
+ | --- | --- | --- |
252
+ | `/api/search` | GET | FTS5 search with filters + sort. |
253
+ | `/api/folders` | GET | Distinct folder paths and message counts (for the tree). |
254
+ | `/api/psts` | GET | List of indexed PSTs. |
255
+ | `/api/psts/{pst_id}` | DELETE | Remove a PST from the index. |
256
+ | `/api/pick-pst` | POST | Open a native OS file picker dialog and return the chosen path. |
257
+ | `/api/index` | POST | Start a background indexing job. Body: `{path, options?}`. |
258
+ | `/api/jobs` / `/api/jobs/{id}` | GET | Job progress polling. |
259
+ | `/api/settings` | GET | Runtime config (host, port, db path, local-only flag). |
260
+ | `/api/open-data-folder` | POST | Open the index DB folder in the OS file manager. |
261
+ | `/api/message/{id}` | GET | Message metadata + attachment list. |
262
+ | `/api/message/{id}/export.eml` | GET | Download the message as a standard `.eml` file. |
263
+ | `/api/attachment/{msg}/{idx}` | GET | Stream one attachment's bytes from the source PST. |
264
+
265
+ ## Performance notes
266
+
267
+ - Indexing throughput is ~35 messages/sec end-to-end on a typical desktop. An 8GB / 27K-message PST takes ~13 minutes with default options.
268
+ - Default body cap is 32 KB per message — roughly 5,000+ words, well past the length of normal correspondence. Marketing emails with hundreds of KB of HTML are truncated, but the useful content (greeting, offer, call-to-action) is always in the first few KB. Tune in the Add-PST dialog or via `--body-cap KB`.
269
+ - By default, messages larger than 4 MB total skip body extraction entirely (subject/sender/recipients/folder still indexed). On a typical mailbox this affects well under 1% of messages. Tune via `--max-html-fetch MB`.
270
+ - Skipping body extraction altogether (`--no-body` or unchecking _Index message bodies_) makes indexing dramatically faster for huge archives where only header-level search matters.
271
+ - Recipients are parsed from `transportMessageHeaders` rather than `pst-extractor`'s `getRecipient()` API, which hits disk per recipient and dominates indexing time on big PSTs (measured 120 ms/message vs effectively free for header parsing).
272
+ - Attachment downloads and `.eml` export each spawn a fresh Node process (~100–300 ms latency per click). Fine for one-off use; not built for batch export. The attachment bytes are never stored in the index — they're streamed straight from the PST on demand.
273
+
274
+ ## License
275
+
276
+ `pst-search` is MIT-licensed (see `LICENSE`). Third-party dependencies and their licenses are listed in `THIRD_PARTY_LICENSES.md`.
277
+
278
+ ## A note on "password-protected" PST files
279
+
280
+ Outlook lets you set a password on a PST. Despite the name, **this is not encryption of the message content** — it's a hash stored in the PST header that Outlook checks before opening the file. The actual messages and attachments are stored as cleartext (or with a weak public byte-permutation cipher that every PST library handles transparently).
281
+
282
+ This means:
283
+
284
+ - `pst-search` reads password-protected PSTs without asking for a password, because the underlying parser (pst-extractor) doesn't honor the header check. This matches the default behavior of essentially every PST tool — libpff, libpst, SysTools, Aspose, and the rest.
285
+ - This is true of the format itself, not specific to our tool. Microsoft documented this in `[MS-PST]`. Anyone with the file can read its contents regardless of the password.
286
+ - If you need the contents of a PST to remain confidential, **rely on file-system encryption** (BitLocker, FileVault, LUKS, an encrypted disk image) rather than Outlook's PST password.
287
+ - Individual messages encrypted via S/MIME are a different mechanism (per-message PKCS#7, requires the recipient's private key) and `pst-search` cannot decrypt those. Their bodies will appear as encrypted blobs in the search index, which is correct behavior.
288
+
289
+ ## Known limitations
290
+
291
+ - **Internal search folders are skipped.** Some PSTs contain auto-generated "search root" folders (`SPAM Search Folder 2`, `ItemProcSearch`, `PST Conversation Lookup`, etc.) that hold search caches rather than user mail. `pst-extractor` can't reliably enumerate them and we explicitly skip them. No real mail is missed.
292
+ - **No incremental indexing.** Re-running `index` on the same PST replaces all its rows. Fine for static archives; not designed for live mailboxes where the source file keeps changing.
293
+ - **The source PST must stay where you indexed it.** We store the absolute path in the database and need to re-open the file for attachment downloads and `.eml` export. If you move or rename the `.pst`, those operations return a clear error and you'll need to re-index.
294
+ - **S/MIME-encrypted messages are not decrypted.** Per-message PKCS#7 encryption requires the recipient's private key — out of scope for this tool. Such messages appear in the index with encrypted-looking body content. Their headers (subject, sender, date) are still searchable.
295
+ - **HTML body is converted to plain text in the search index.** The detail pane shows the stripped text. The original HTML is preserved when you export the message as `.eml`, but the in-app body view is text only. Tradeoff for compact storage and reliable search.
296
+
@@ -0,0 +1,254 @@
1
+ # PST Search
2
+
3
+ A local search engine for Outlook PST files. Index once, then search by subject, body, sender, recipients, folder, or date — and pull attachments directly from the source PST on demand. Built around SQLite FTS5 for instant full-text search.
4
+
5
+ ![PST Search UI — folder tree, search results, message detail](docs/screenshot.png)
6
+
7
+ ## What it's for
8
+
9
+ `pst-search` is for anyone who has one or more Outlook `.pst` files and needs to actually search through them — without installing Outlook, without uploading the archive to a cloud service, and without writing throwaway scripts.
10
+
11
+ Common situations it solves:
12
+
13
+ - **Old mailbox archives.** Email from a previous job, a retired account, or a long-running personal mailbox you've exported. The PST is sitting on a drive somewhere and you want to find things in it.
14
+ - **Recovery and lookup.** Someone hands you a `.pst` and asks "is there an email about X?" or "find everything from Bob in 2024." You point the tool at it, the tool tells you.
15
+ - **Forwarding specific messages.** Pull a single old email out as a standard `.eml` file (with all its attachments) and drop it into any mail client to forward, archive, or attach to a ticket.
16
+ - **Forensic, discovery, or compliance work.** Structured search across folders, senders, recipients, attachments, and date ranges. Multi-PST library so you can index a stack of archives and search them together.
17
+ - **Privacy-conscious search.** Everything runs on `127.0.0.1`. No data leaves the machine, no account required, the PST file is never uploaded anywhere.
18
+
19
+ ### Why a new tool?
20
+
21
+ Most Python tooling for PSTs sits on top of `libpff`, which has a long-standing unfixed parsing bug (`libpff_table_read: invalid table - missing data identifier`) that makes it unable to read certain real-world PSTs — particularly those exported by recent versions of Outlook. We hit that on a real 8 GB mailbox where libpff couldn't read a single message. `pst-search` routes through an independent codebase (`pst-extractor`, a Node.js port of `java-libpst`), so PSTs that defeat libpff still open here.
22
+
23
+ ## Features
24
+
25
+ - **Full-text search** across subject, body, sender, recipients, and folder path. FTS5-ranked, with `<mark>`-highlighted snippets.
26
+ - **Gmail-style operators** in the search box: `from:bob`, `to:alice`, `subject:budget`, `body:meeting`, `folder:inbox`, combined with `AND`/`OR`/`NOT`, quoted phrases, prefix matching (`meet*`), and parentheses. Click the **?** next to the search box for the full cheatsheet.
27
+ - **Browse mode** — leave the search box empty to list messages newest-first; click any folder in the tree to filter to it.
28
+ - **Sort by date or relevance** — dropdown in the result-list header switches between Newest first (default), Oldest first, and Relevance (BM25 ranking for search queries).
29
+ - **Filters**: from, to, folder, date range, has-attachments.
30
+ - **Lazy attachments**: the index stores only filenames and sizes. Clicking an attachment re-opens the PST and extracts that one file on demand. No multi-GB attachment dump on disk.
31
+ - **Export to `.eml`**: every message has a Download button that produces a standard RFC 5322 `.eml` file with headers, body (plain + HTML), and all attachments. Opens in Outlook, Thunderbird, Apple Mail, or any webmail upload.
32
+ - **Multiple PSTs** in one index. Re-indexing a PST replaces its rows in place.
33
+ - **Local-only**: everything runs on `127.0.0.1`. No data leaves your machine.
34
+
35
+ ## Requirements
36
+
37
+ `pst-search` needs **two runtimes** on the machine before it can do anything useful:
38
+
39
+ - **Python 3.10 or newer** — the indexer, search API, and CLI.
40
+ - **Node.js 18 or newer** — the PST parser (`pst-extractor`) runs as a Node subprocess. `npm` ships with Node.
41
+
42
+ | | Windows | macOS | Linux (Ubuntu/Debian) |
43
+ | --- | --- | --- | --- |
44
+ | Python 3.10+ | `winget install Python.Python.3.12` | `brew install python@3.12` | `sudo apt install python3 python3-pip python3-tk python3-venv` |
45
+ | Node.js 18+ | `winget install OpenJS.NodeJS.LTS` | `brew install node` | `sudo apt install nodejs npm` |
46
+
47
+ > **macOS Homebrew users**: also run `brew install python-tk@3.12` so the file picker dialog works. (The python.org installer includes it already.)
48
+
49
+ ## Quick start
50
+
51
+ ### Install from PyPI
52
+
53
+ ```bash
54
+ pip install pst-search
55
+ pstsearch setup # one-time: pulls down the Node-side pst-extractor library
56
+ pstsearch serve
57
+ ```
58
+
59
+ `pstsearch setup` is a thin wrapper around `npm install` for the bundled Node helper. If you skip it, the first indexing run will install the dependencies for you automatically.
60
+
61
+ ### …or install from source
62
+
63
+ ```bash
64
+ # If you have git installed:
65
+ git clone https://github.com/KD5RYN/pst-search
66
+ cd pst-search
67
+ pip install -e .
68
+ (cd pst_search/node && npm install)
69
+ pstsearch serve
70
+ ```
71
+
72
+ …or **download as a ZIP** from <https://github.com/KD5RYN/pst-search> (green **Code** button → **Download ZIP**), then unzip, `cd` into the folder, and run the same two install commands.
73
+
74
+ On Windows PowerShell the second install line is:
75
+
76
+ ```pwsh
77
+ cd pst_search\node; npm install; cd ..\..
78
+ ```
79
+
80
+ ### Run
81
+
82
+ ```bash
83
+ pstsearch serve
84
+ ```
85
+
86
+ A browser tab opens at <http://127.0.0.1:8765>.
87
+
88
+ 1. Click **📁 Manage PSTs → + Add another PST**
89
+ 2. Pick your `.pst` file in the native dialog
90
+ 3. **Adjust indexing options** (or accept defaults) and click **Start indexing**
91
+ 4. Search as soon as the first batch lands; the rest streams in behind you
92
+
93
+ When it's done, search.
94
+
95
+ The search index lives in your per-user data directory:
96
+
97
+ - Windows: `%APPDATA%\pst-search\index.db`
98
+ - macOS: `~/Library/Application Support/pst-search/index.db`
99
+ - Linux: `$XDG_DATA_HOME/pst-search/index.db` (default `~/.local/share/pst-search/index.db`)
100
+
101
+ Delete that file to wipe the index and start over.
102
+
103
+ ## Search syntax
104
+
105
+ The search box accepts the same operators most users already know from Gmail and Outlook, plus all of SQLite FTS5's native query language.
106
+
107
+ **Operators:**
108
+
109
+ | Type | Means |
110
+ | --- | --- |
111
+ | `from:bob` | sender name or email contains "bob" |
112
+ | `to:alice` | any recipient (To/Cc/Bcc) contains "alice" |
113
+ | `subject:budget` | match restricted to the subject |
114
+ | `body:meeting` | match restricted to the body |
115
+ | `folder:inbox` | folder path contains "inbox" |
116
+ | `cc:` / `bcc:` | recipients (we don't distinguish To/Cc/Bcc) |
117
+
118
+ **Combining:**
119
+
120
+ | Form | Means |
121
+ | --- | --- |
122
+ | `a b` | both words present (implicit AND) |
123
+ | `a AND b` | both — explicit |
124
+ | `a OR b` | either |
125
+ | `a NOT b` | a but not b |
126
+ | `"q4 plan"` | exact phrase |
127
+ | `meet*` | prefix — matches meeting, meetup, meets, … |
128
+ | `(a OR b) AND c` | group with parens |
129
+
130
+ **Example:** `from:bob AND subject:budget NOT folder:trash` — emails from Bob about budgets that aren't in any trash folder.
131
+
132
+ Click the **?** icon at the right edge of the search box for a popup version of this cheatsheet.
133
+
134
+ ## Indexing options
135
+
136
+ The "Add a PST" dialog and the `pstsearch index` CLI command both expose the same three knobs. Defaults work for almost every mailbox; tweak them only when the defaults don't fit your data.
137
+
138
+ | Option | GUI label | CLI flag | Default | When to change |
139
+ | --- | --- | --- | --- | --- |
140
+ | Include message bodies | _Index message bodies_ (checkbox) | `--no-body` | on | Off for **huge archives** when you only need to search by subject/sender — indexing becomes dramatically faster. |
141
+ | Max body length kept | _Max body length per message_ | `--body-cap KB` | 32 KB | Raise (up to 1024 KB) if your real-content emails routinely run longer; lower to shrink the index. |
142
+ | Skip body for very large messages | _Skip bodies larger than_ | `--max-html-fetch MB` | 4 MB | Lower if you want to ignore giant newsletter-style mail; raise toward 100 MB if you specifically want body text from huge messages too. |
143
+
144
+ Open the **Advanced options** disclosure in the Add-PST dialog to see and adjust the last two.
145
+
146
+ ## App settings (⚙️ button)
147
+
148
+ Click the gear icon in the header to see what the server is currently doing:
149
+
150
+ - **Listening at** — the URL the server is bound to
151
+ - **Network access** — confirms whether you're local-only or exposed
152
+ - **Index database** — where the SQLite file lives, with an **Open data folder** button
153
+
154
+ These are read-only because changing them requires restarting the server. To change them, pass flags to `pstsearch serve` (see below).
155
+
156
+ ## Commands
157
+
158
+ ```
159
+ pstsearch serve [--host HOST] [--port PORT] [--db PATH] [--no-browser]
160
+ Launch the web UI. Defaults: --host 127.0.0.1 --port 8765.
161
+ Pass --host 0.0.0.0 to expose to your LAN (DO NOT do this on an
162
+ untrusted network — anyone reaching the port can search your mail).
163
+
164
+ pstsearch index FILE.pst
165
+ [--no-body]
166
+ [--body-cap KB]
167
+ [--max-html-fetch MB]
168
+ [--db PATH]
169
+ Index a PST from the command line. Re-running on the same file
170
+ replaces its rows. Options mirror the GUI Add-PST dialog.
171
+
172
+ pstsearch list
173
+ Show indexed PSTs (id, message count, path, indexed-at).
174
+
175
+ pstsearch setup
176
+ One-time install of the Node-side dependencies (pst-extractor and friends).
177
+ Safe to re-run. Indexing will auto-bootstrap these on first use if you
178
+ forget, so this command is mostly for users who want the install to
179
+ happen up front rather than the first time they hit "Index".
180
+ ```
181
+
182
+ ## Architecture
183
+
184
+ ```
185
+ PST file --[Node + pst-extractor]--NDJSON--> Python indexer --[SQLite + FTS5]--> Search API --[HTML/JS]--> Browser
186
+ |
187
+ (on attachment click,
188
+ spawn Node, extract
189
+ one attachment by
190
+ descriptor node ID)
191
+ ```
192
+
193
+ | Layer | File | Purpose |
194
+ | --- | --- | --- |
195
+ | PST extractor | `pst_search/node/extract.mjs` | Walks the PST with `pst-extractor`, streams one NDJSON record per message to stdout. |
196
+ | Attachment extractor | `pst_search/node/attachment.mjs` | Pulls a single attachment's bytes from a PST by descriptor node ID. |
197
+ | Message dump | `pst_search/node/message.mjs` | Full message export (headers + both body forms + every attachment) for `.eml` building. |
198
+ | Python driver | `pst_search/pst.py` | Spawns Node, parses NDJSON, exposes Python iterators, attachment fetch, and full-message export. |
199
+ | Indexer | `pst_search/indexer.py` | Consumes the message stream and bulk-inserts into SQLite. |
200
+ | Indexing jobs | `pst_search/jobs.py` | Background indexing thread + job registry. Lets the web UI fire off a scan and poll for progress. |
201
+ | Database | `pst_search/db.py` | Schema + FTS5 virtual table + search/browse queries + Gmail-style operator translation. |
202
+ | Server | `pst_search/server.py` | FastAPI endpoints — see below. |
203
+ | Web UI | `pst_search/web/index.html` | Single-file frontend (HTML + inline CSS + JS), no build step. |
204
+ | CLI | `pst_search/cli.py` | `index` / `serve` / `list` entry points. |
205
+
206
+ **HTTP API:**
207
+
208
+ | Endpoint | Method | Purpose |
209
+ | --- | --- | --- |
210
+ | `/api/search` | GET | FTS5 search with filters + sort. |
211
+ | `/api/folders` | GET | Distinct folder paths and message counts (for the tree). |
212
+ | `/api/psts` | GET | List of indexed PSTs. |
213
+ | `/api/psts/{pst_id}` | DELETE | Remove a PST from the index. |
214
+ | `/api/pick-pst` | POST | Open a native OS file picker dialog and return the chosen path. |
215
+ | `/api/index` | POST | Start a background indexing job. Body: `{path, options?}`. |
216
+ | `/api/jobs` / `/api/jobs/{id}` | GET | Job progress polling. |
217
+ | `/api/settings` | GET | Runtime config (host, port, db path, local-only flag). |
218
+ | `/api/open-data-folder` | POST | Open the index DB folder in the OS file manager. |
219
+ | `/api/message/{id}` | GET | Message metadata + attachment list. |
220
+ | `/api/message/{id}/export.eml` | GET | Download the message as a standard `.eml` file. |
221
+ | `/api/attachment/{msg}/{idx}` | GET | Stream one attachment's bytes from the source PST. |
222
+
223
+ ## Performance notes
224
+
225
+ - Indexing throughput is ~35 messages/sec end-to-end on a typical desktop. An 8GB / 27K-message PST takes ~13 minutes with default options.
226
+ - Default body cap is 32 KB per message — roughly 5,000+ words, well past the length of normal correspondence. Marketing emails with hundreds of KB of HTML are truncated, but the useful content (greeting, offer, call-to-action) is always in the first few KB. Tune in the Add-PST dialog or via `--body-cap KB`.
227
+ - By default, messages larger than 4 MB total skip body extraction entirely (subject/sender/recipients/folder still indexed). On a typical mailbox this affects well under 1% of messages. Tune via `--max-html-fetch MB`.
228
+ - Skipping body extraction altogether (`--no-body` or unchecking _Index message bodies_) makes indexing dramatically faster for huge archives where only header-level search matters.
229
+ - Recipients are parsed from `transportMessageHeaders` rather than `pst-extractor`'s `getRecipient()` API, which hits disk per recipient and dominates indexing time on big PSTs (measured 120 ms/message vs effectively free for header parsing).
230
+ - Attachment downloads and `.eml` export each spawn a fresh Node process (~100–300 ms latency per click). Fine for one-off use; not built for batch export. The attachment bytes are never stored in the index — they're streamed straight from the PST on demand.
231
+
232
+ ## License
233
+
234
+ `pst-search` is MIT-licensed (see `LICENSE`). Third-party dependencies and their licenses are listed in `THIRD_PARTY_LICENSES.md`.
235
+
236
+ ## A note on "password-protected" PST files
237
+
238
+ Outlook lets you set a password on a PST. Despite the name, **this is not encryption of the message content** — it's a hash stored in the PST header that Outlook checks before opening the file. The actual messages and attachments are stored as cleartext (or with a weak public byte-permutation cipher that every PST library handles transparently).
239
+
240
+ This means:
241
+
242
+ - `pst-search` reads password-protected PSTs without asking for a password, because the underlying parser (pst-extractor) doesn't honor the header check. This matches the default behavior of essentially every PST tool — libpff, libpst, SysTools, Aspose, and the rest.
243
+ - This is true of the format itself, not specific to our tool. Microsoft documented this in `[MS-PST]`. Anyone with the file can read its contents regardless of the password.
244
+ - If you need the contents of a PST to remain confidential, **rely on file-system encryption** (BitLocker, FileVault, LUKS, an encrypted disk image) rather than Outlook's PST password.
245
+ - Individual messages encrypted via S/MIME are a different mechanism (per-message PKCS#7, requires the recipient's private key) and `pst-search` cannot decrypt those. Their bodies will appear as encrypted blobs in the search index, which is correct behavior.
246
+
247
+ ## Known limitations
248
+
249
+ - **Internal search folders are skipped.** Some PSTs contain auto-generated "search root" folders (`SPAM Search Folder 2`, `ItemProcSearch`, `PST Conversation Lookup`, etc.) that hold search caches rather than user mail. `pst-extractor` can't reliably enumerate them and we explicitly skip them. No real mail is missed.
250
+ - **No incremental indexing.** Re-running `index` on the same PST replaces all its rows. Fine for static archives; not designed for live mailboxes where the source file keeps changing.
251
+ - **The source PST must stay where you indexed it.** We store the absolute path in the database and need to re-open the file for attachment downloads and `.eml` export. If you move or rename the `.pst`, those operations return a clear error and you'll need to re-index.
252
+ - **S/MIME-encrypted messages are not decrypted.** Per-message PKCS#7 encryption requires the recipient's private key — out of scope for this tool. Such messages appear in the index with encrypted-looking body content. Their headers (subject, sender, date) are still searchable.
253
+ - **HTML body is converted to plain text in the search index.** The detail pane shows the stripped text. The original HTML is preserved when you export the message as `.eml`, but the in-app body view is text only. Tradeoff for compact storage and reliable search.
254
+
@@ -0,0 +1,45 @@
1
+ # Third-Party Licenses
2
+
3
+ `pst-search` itself is released under the MIT License (see `LICENSE`). It uses
4
+ the following third-party components — all under permissive open-source
5
+ licenses compatible with redistribution.
6
+
7
+ ## Python dependencies (runtime)
8
+
9
+ | Package | License | Source |
10
+ | --- | --- | --- |
11
+ | [FastAPI](https://github.com/fastapi/fastapi) | MIT | <https://github.com/fastapi/fastapi/blob/master/LICENSE> |
12
+ | [Uvicorn](https://github.com/encode/uvicorn) | BSD-3-Clause | <https://github.com/encode/uvicorn/blob/master/LICENSE.md> |
13
+ | [Starlette](https://github.com/encode/starlette) (via FastAPI) | BSD-3-Clause | <https://github.com/encode/starlette/blob/master/LICENSE.md> |
14
+ | [Pydantic](https://github.com/pydantic/pydantic) (via FastAPI) | MIT | <https://github.com/pydantic/pydantic/blob/main/LICENSE> |
15
+ | [Click](https://github.com/pallets/click) | BSD-3-Clause | <https://github.com/pallets/click/blob/main/LICENSE.txt> |
16
+ | [anyio](https://github.com/agronholm/anyio) (via Starlette) | MIT | <https://github.com/agronholm/anyio/blob/master/LICENSE> |
17
+ | [h11](https://github.com/python-hyper/h11) (via Uvicorn) | MIT | <https://github.com/python-hyper/h11/blob/master/LICENSE.txt> |
18
+
19
+ ## Node dependencies (runtime, in `pst_search/node/`)
20
+
21
+ | Package | License | Source |
22
+ | --- | --- | --- |
23
+ | [pst-extractor](https://github.com/epfromer/pst-extractor) | MIT | <https://github.com/epfromer/pst-extractor/blob/master/LICENSE> |
24
+ | [iconv-lite](https://github.com/ashtuchkin/iconv-lite) (via pst-extractor) | MIT | <https://github.com/ashtuchkin/iconv-lite/blob/master/LICENSE> |
25
+ | [long](https://github.com/dcodeIO/long.js) (via pst-extractor) | Apache-2.0 | <https://github.com/dcodeIO/long.js/blob/main/LICENSE> |
26
+ | [uuid-parse](https://github.com/zefferus/uuid-parse) (via pst-extractor) | MIT | <https://github.com/zefferus/uuid-parse/blob/master/LICENSE.md> |
27
+
28
+ `pst-extractor` is itself a TypeScript port of
29
+ [java-libpst](https://github.com/rjohnsondev/java-libpst) (MIT), which
30
+ independently implements Microsoft's PST format from the publicly-documented
31
+ [MS-PST specification](https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-pst/).
32
+
33
+ ## Format specification
34
+
35
+ This project reads files written in the Microsoft Personal Storage Table
36
+ (PST) format. The format is publicly documented by Microsoft under the
37
+ [Open Specification Promise](https://learn.microsoft.com/en-us/openspecs/dev_center/ms-devcentlp/051cd324-7081-4f6e-a30c-9c4575c4b921);
38
+ no Microsoft code is used or redistributed.
39
+
40
+ ## License compatibility
41
+
42
+ All runtime dependencies are released under MIT, BSD-3-Clause, or Apache-2.0 —
43
+ permissive licenses fully compatible with redistributing `pst-search` under
44
+ the MIT License. The notices above satisfy the attribution requirements of
45
+ each license; no source-code redistribution obligation exists for any of them.
@@ -0,0 +1 @@
1
+ __version__ = "0.1.0"
@@ -0,0 +1,5 @@
1
+ """Entry point for `python -m pst_search`."""
2
+ from pst_search.cli import main
3
+
4
+ if __name__ == "__main__":
5
+ main()