fastemailparser 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Méthode
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,478 @@
1
+ Metadata-Version: 2.4
2
+ Name: fastemailparser
3
+ Version: 0.1.0
4
+ Summary: Very fast email parsing tool, split emails, retrieve headers & signatures
5
+ Home-page: https://github.com/Methode-dev/EmailParser
6
+ Author: Julien Calenge @ Méthode
7
+ Author-email: julien.calenge@methode.dev
8
+ Classifier: Programming Language :: C
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Operating System :: OS Independent
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Dynamic: author
14
+ Dynamic: author-email
15
+ Dynamic: classifier
16
+ Dynamic: description
17
+ Dynamic: description-content-type
18
+ Dynamic: home-page
19
+ Dynamic: license-file
20
+ Dynamic: summary
21
+
22
+ # emailparser
23
+
24
+ A C extension for Python that splits email reply chains into individual
25
+ segments and extracts structured data from each one.
26
+
27
+ This README has been generated with Claude Code.
28
+ The code itself has been partially generated with Claude Code, especially the guards and error management as it turns out I find it highly efficient at those kind of tasks. The core logic is man-made but there is no strong re-reads of the modifications Claude made so there might be errors in there.
29
+
30
+ Please find CONTRIBUTING.md, as it is a deep-dive documentation into the code.
31
+
32
+ ---
33
+
34
+ ## Table of contents
35
+
36
+ 0. [Installation & Quick Start](#0-installation)
37
+ 1. [Building](#1-building)
38
+ 2. [Concepts](#2-concepts)
39
+ 3. [The Email iterator](#3-the-email-iterator)
40
+ - [Input formats](#input-formats)
41
+ - [plain_text](#plain_text)
42
+ - [standalone](#standalone)
43
+ - [strip_headers](#strip_headers)
44
+ - [outer_headers](#outer_headers)
45
+ 4. [parse_headers](#4-parse_headers)
46
+ 5. [extract_body](#5-extract_body)
47
+ 6. [find_signature](#6-find_signature)
48
+ 7. [strip_signature](#7-strip_signature)
49
+ 8. [Putting it all together](#8-putting-it-all-together)
50
+ 9. [Source layout](#9-source-layout)
51
+
52
+ ---
53
+ ## 0. Installation
54
+ ```bash
55
+ pip install emailparser
56
+ ```
57
+
58
+ ---
59
+ ## 1. Building
60
+
61
+ ```bash
62
+ pip install setuptools
63
+ python3 setup.py build_ext --inplace
64
+ ```
65
+
66
+ This compiles `emailparser.cpython-*.so` into the project root. All tests
67
+ can then be run with:
68
+
69
+ ```bash
70
+ python3 -m pytest functional_tests.py test_emailparser.py -v
71
+ ```
72
+
73
+ ---
74
+
75
+ ## 2. Concepts
76
+
77
+ An email reply chain looks like this on disk (plain text or HTML):
78
+
79
+ ```
80
+ Latest reply body…
81
+
82
+ From: Alice <alice@example.com>
83
+ Sent: Monday, 2 June 2025 10:00 AM
84
+ To: Bob
85
+ Subject: RE: Project update
86
+
87
+ Previous reply body…
88
+
89
+ From: Bob <bob@example.com>
90
+
91
+ ```
92
+
93
+ `emailparser` finds every `From:` / `De:` separator and yields the content
94
+ between them as **segments**. Each segment begins with the separator header
95
+ block and ends just before the next one.
96
+
97
+ Key facts:
98
+
99
+ - **Segment 0** is the content *before* the first separator — the body of
100
+ the most recent (top) reply. It has no inline headers of its own.
101
+ - **Segments 1, 2, …** each start with their own `From:`/`Sent:`/`To:`/
102
+ `Subject:` block, followed by a blank line and then the reply body.
103
+ - For raw MIME emails (`.eml`, `Date: …` at the top of the file), the outer
104
+ header block is skipped automatically and exposed via `outer_headers`.
105
+
106
+ ---
107
+
108
+ ## 3. The Email iterator
109
+
110
+ ```python
111
+ import emailparser
112
+
113
+ for segment in emailparser.Email("path/to/email.html"):
114
+ print(segment)
115
+ ```
116
+
117
+ `Email` accepts a file path, a raw string, or bytes. It is an iterator:
118
+ calling `next()` on it returns one segment at a time.
119
+
120
+ ### Input formats
121
+
122
+ | Input | Behaviour |
123
+ |---|---|
124
+ | `Email("path/to/file.html")` | Opens and reads the file |
125
+ | `Email("<html>…</html>")` | String with no matching file → used as raw content |
126
+ | `Email(b"raw bytes")` | Always treated as raw content |
127
+
128
+ ```python
129
+ # File path
130
+ for seg in emailparser.Email("mail.txt"):
131
+ ...
132
+
133
+ # Raw string
134
+ content = open("mail.txt").read()
135
+ for seg in emailparser.Email(content):
136
+ ...
137
+
138
+ # Bytes
139
+ raw = open("mail.txt", "rb").read()
140
+ for seg in emailparser.Email(raw):
141
+ ...
142
+ ```
143
+
144
+ All three produce identical results.
145
+
146
+ ---
147
+
148
+ ### plain_text
149
+
150
+ ```python
151
+ emailparser.Email(source, plain_text=True)
152
+ ```
153
+
154
+ Strips HTML tags and decodes entities (`&lt;`, `&nbsp;`, …) via libxml2
155
+ before yielding each segment. Block-level elements (`<p>`, `<div>`, `<br>`,
156
+ …) are replaced with newlines.
157
+
158
+ ```python
159
+ segs = list(emailparser.Email("mail.txt", plain_text=True))
160
+
161
+ # Without plain_text:
162
+ # '<div data-test-id="mailMessageBodyContainer">Dear Ms. De Pedro…'
163
+
164
+ # With plain_text:
165
+ # '\n\n\nDear Ms. De Pedro,\nGood day,\n…'
166
+ ```
167
+
168
+ Use `plain_text=True` whenever you want to process the text content rather
169
+ than render the HTML.
170
+
171
+ ---
172
+
173
+ ### standalone
174
+
175
+ ```python
176
+ emailparser.Email(source, standalone=True)
177
+ ```
178
+
179
+ Wraps each segment in a complete, self-contained HTML document so it can be
180
+ saved to a file and opened directly in a browser:
181
+
182
+ ```html
183
+ <!DOCTYPE html>
184
+ <html><head>
185
+ <meta charset="UTF-8">
186
+ <style>/* base CSS + all <style> blocks extracted from the source */</style>
187
+ </head>
188
+ <body>
189
+ <!-- segment content -->
190
+ </body></html>
191
+ ```
192
+
193
+ - HTML segments are embedded as-is.
194
+ - Plain-text / quoted-printable segments are decoded and wrapped in `<pre>`.
195
+ - Ignored when combined with `plain_text=True`.
196
+
197
+ ```python
198
+ segs = list(emailparser.Email("test_emails/test2.html", standalone=True))
199
+
200
+ with open("segment_0.html", "w") as f:
201
+ f.write(segs[0]) # open in browser and it renders correctly
202
+ ```
203
+
204
+ ---
205
+
206
+ ### strip_headers
207
+
208
+ ```python
209
+ emailparser.Email(source, plain_text=True, strip_headers=True)
210
+ ```
211
+
212
+ Removes the `From:`/`To:`/`Subject:`/`Date:` header block from the top of
213
+ each segment, leaving only the reply body.
214
+
215
+ - Has **no effect** on segment 0 (which starts with the reply body directly,
216
+ not with a header block).
217
+ - Works for both English headers (`From:`, `Sent:`) and French headers
218
+ (`De :`, `Envoyé :`, `À :`).
219
+
220
+ ```python
221
+ # Without strip_headers:
222
+ # '\nFrom: D9A (Branko Olic)…\nSent: Wednesday…\nTo: docs\n\nHi Abby,…'
223
+
224
+ # With strip_headers:
225
+ # '\nHi Abby,…'
226
+ ```
227
+
228
+ You can combine all three flags:
229
+
230
+ ```python
231
+ for body in emailparser.Email("chain.html", plain_text=True, strip_headers=True):
232
+ print(body)
233
+ ```
234
+
235
+ ---
236
+
237
+ ### outer_headers
238
+
239
+ For raw MIME emails the outer header block (the metadata of the most recent
240
+ email) is skipped during iteration but remains accessible as a property:
241
+
242
+ ```python
243
+ email = emailparser.Email("test_emails/test2.html")
244
+
245
+ print(email.outer_headers)
246
+ # {
247
+ # "from": '"D9A (Branko Olic) Marlow CD-D9A" <d9a@marlowgroup.com>',
248
+ # "to": ['docs <docs@interportfrance.fr>'],
249
+ # "cc": ['"g2.mnph@marlowgroup.com" <g2.mnph@marlowgroup.com>', …],
250
+ # "bcc": [],
251
+ # "subject": "MV RANGER - Schengen Visa - OS BALABAT",
252
+ # "date": "Wed, 11 Jun 2025 13:25:14 +0200"
253
+ # }
254
+ ```
255
+
256
+ Returns `None` for pure HTML emails (e.g. `mail.txt`) that have no outer
257
+ MIME header block.
258
+
259
+ ```python
260
+ if email.outer_headers:
261
+ sender = email.outer_headers["from"]
262
+ ```
263
+
264
+ ---
265
+
266
+ ## 4. parse_headers
267
+
268
+ ```python
269
+ emailparser.parse_headers(segment) -> dict
270
+ ```
271
+
272
+ Extracts the header fields from any segment string.
273
+
274
+ **Returns** a dict with these keys — always present, defaults shown:
275
+
276
+ | Key | Type | Default |
277
+ |---|---|---|
278
+ | `"from"` | `str \| None` | `None` |
279
+ | `"to"` | `list[str]` | `[]` |
280
+ | `"cc"` | `list[str]` | `[]` |
281
+ | `"bcc"` | `list[str]` | `[]` |
282
+ | `"subject"` | `str \| None` | `None` |
283
+ | `"date"` | `str \| None` | `None` |
284
+
285
+ Recognised field names (case-insensitive):
286
+
287
+ | Language | Fields |
288
+ |---|---|
289
+ | English | `From`, `To`, `CC`, `BCC`, `Subject`, `Date`, `Sent`, `Reply-To` |
290
+ | French | `De`, `À` / `à`, `Cci`, `Objet`, `Envoyé` |
291
+
292
+ Handles HTML segments and quoted-printable encoding automatically.
293
+
294
+ ```python
295
+ segs = list(emailparser.Email("test_emails/test2.html", plain_text=True))
296
+ seg = segs[3] # a quoted reply starting with "From: D9A…"
297
+
298
+ h = emailparser.parse_headers(seg)
299
+ # {
300
+ # "from": "D9A (Branko Olic) Marlow CD-D9A",
301
+ # "to": ["docs"],
302
+ # "cc": ["g2.mnph@marlowgroup.com", "Info - INTERPORT", …],
303
+ # "bcc": [],
304
+ # "subject": "RE: MV RANGER - Schengen Visa - OS BALABAT",
305
+ # "date": "Wednesday, June 11, 2025 11:49 AM"
306
+ # }
307
+
308
+ print(h["from"]) # "D9A (Branko Olic) Marlow CD-D9A"
309
+ print(h["to"]) # ["docs"]
310
+ print(h["subject"]) # "RE: MV RANGER - Schengen Visa - OS BALABAT"
311
+ ```
312
+
313
+ > **Note:** `parse_headers` on segment 0 returns all `None`/`[]` because
314
+ > segment 0 is the latest reply body with no inline header block.
315
+ > Use `email.outer_headers` to access the metadata of that first email.
316
+
317
+ ---
318
+
319
+ ## 5. extract_body
320
+
321
+ ```python
322
+ emailparser.extract_body(segment) -> str
323
+ ```
324
+
325
+ Returns the segment with the header block stripped — symmetric counterpart
326
+ to `parse_headers`.
327
+
328
+ - If the segment starts with a recognised header field, scans to the first
329
+ blank line and returns everything after it.
330
+ - If the segment does **not** start with a recognised header (e.g. segment 0),
331
+ it is returned **unchanged**.
332
+
333
+ ```python
334
+ seg = segs[3] # starts with "From: D9A…\nSent:…\nTo:…\n\nHi Abby,…"
335
+ body = emailparser.extract_body(seg)
336
+ # '\nHi Abby,\n\nGm,\n\nPls see blw…'
337
+
338
+ # Segment 0 has no header block — returned as-is
339
+ body_0 = emailparser.extract_body(segs[0])
340
+ # '\n\nDear Ms. De Pedro,\nGood day,…'
341
+ ```
342
+
343
+ Pair it with `parse_headers` to access both parts independently:
344
+
345
+ ```python
346
+ headers = emailparser.parse_headers(seg)
347
+ body = emailparser.extract_body(seg)
348
+ ```
349
+
350
+ ---
351
+
352
+ ## 6. find_signature
353
+
354
+ ```python
355
+ emailparser.find_signature(text) -> int
356
+ ```
357
+
358
+ Returns the **character index** where the signature block starts, or `-1`
359
+ if no signature is found.
360
+
361
+ Detects:
362
+
363
+ | Pattern | Example |
364
+ |---|---|
365
+ | RFC 3676 delimiter | `--` or `-- ` on its own line |
366
+ | English closings | `Kind regards`, `Best regards`, `Sincerely`, `Thanks`, … |
367
+ | French closings | `Cordialement`, `Bien cordialement`, `Merci`, `Salutations` |
368
+
369
+ Works on both plain-text and raw HTML segments (libxml2 DOM path for HTML,
370
+ line-scan fallback for plain text).
371
+
372
+ ```python
373
+ body = emailparser.extract_body(segs[3])
374
+
375
+ idx = emailparser.find_signature(body)
376
+ # 801
377
+
378
+ if idx >= 0:
379
+ message = body[:idx] # reply body without signature
380
+ signature = body[idx:] # "Kind regards,\n\nBranko Olic…"
381
+ ```
382
+
383
+ ---
384
+
385
+ ## 7. strip_signature
386
+
387
+ ```python
388
+ emailparser.strip_signature(text) -> str
389
+ ```
390
+
391
+ Returns `text` with the signature block removed. Shorthand for:
392
+
393
+ ```python
394
+ idx = emailparser.find_signature(text)
395
+ clean = text[:idx] if idx >= 0 else text
396
+ ```
397
+
398
+ Returns the input unchanged if no signature is found.
399
+
400
+ ```python
401
+ body = emailparser.extract_body(segs[3])
402
+ clean = emailparser.strip_signature(body)
403
+ # reply body text only, no "Kind regards,\n\nBranko Olic…"
404
+ ```
405
+
406
+ ---
407
+
408
+ ## 8. Putting it all together
409
+
410
+ Complete example — iterate over a MIME email chain and extract every piece
411
+ of structured data:
412
+
413
+ ```python
414
+ import emailparser
415
+
416
+ source = "test_emails/test2.html"
417
+ email = emailparser.Email(source, plain_text=True)
418
+
419
+ # ── Most recent email (segment 0 has no inline headers) ──────────────────
420
+ first_headers = email.outer_headers # dict or None
421
+ first_body = None # collected below
422
+
423
+ # ── Iterate over all segments ─────────────────────────────────────────────
424
+ for i, seg in enumerate(email):
425
+ headers = emailparser.parse_headers(seg)
426
+ body = emailparser.extract_body(seg)
427
+ clean = emailparser.strip_signature(body)
428
+ sig_idx = emailparser.find_signature(body)
429
+ sig = body[sig_idx:] if sig_idx >= 0 else ""
430
+
431
+ if i == 0:
432
+ first_body = body # segment 0 is already just the body
433
+
434
+ print(f"── Segment {i} ──────────────────────")
435
+ if i == 0:
436
+ # segment 0: headers come from outer_headers
437
+ if first_headers:
438
+ print(f" From: {first_headers['from']}")
439
+ print(f" Subject: {first_headers['subject']}")
440
+ else:
441
+ print(f" From: {headers['from']}")
442
+ print(f" Date: {headers['date']}")
443
+ print(f" Subject: {headers['subject']}")
444
+ print(f" Body: {clean[:60].strip()!r}…")
445
+ if sig:
446
+ print(f" Sig: {sig[:40].strip()!r}…")
447
+ ```
448
+
449
+ ---
450
+
451
+ ## 9. Source layout
452
+
453
+ ```
454
+ emailparser.c Python type (EmailObject) and module init
455
+ email.h email_t struct definition
456
+ setup.py build script — compiles all src/*.c files
457
+ main.c minimal standalone C binary
458
+ src/
459
+ buf.h strbuf_t type and sb_push (header-only)
460
+ mime.h / mime.c decode_qp · skip_mime_headers · has_html_mime_part
461
+ html.h / html.c walk_text · segment_to_text · html_to_plain_c
462
+ standalone.h / .c extract_css · wrap_standalone
463
+ email_iter.h / .c SEPARATOR_REGEX · new_email · get_next_val
464
+ headers.h / .c canonical_key · py_parse_headers
465
+ body.h / .c find_body_start · py_extract_body
466
+ signature.h / .c py_find_signature · py_strip_signature
467
+ test_emailparser.py unittest suite (mail.txt)
468
+ functional_tests.py pytest parametrised suite (test_emails/)
469
+ ```
470
+
471
+ ### Compile-time override
472
+
473
+ The separator regex can be overridden without changing the source:
474
+
475
+ ```bash
476
+ python3 setup.py build_ext --inplace \
477
+ build_ext --define SEPARATOR_REGEX='"(From|De) ?:"'
478
+ ```