mail-parser 4.1.3__tar.gz → 4.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- mail_parser-4.2.0/.github/FUNDING.yml +3 -0
- mail_parser-4.2.0/.github/ISSUE_TEMPLATE/bug_report.md +33 -0
- mail_parser-4.2.0/.github/ISSUE_TEMPLATE/feature_request.md +17 -0
- mail_parser-4.2.0/.github/copilot-instructions.md +226 -0
- mail_parser-4.2.0/.github/instructions/containerization-docker-best-practices.instructions.md +681 -0
- mail_parser-4.2.0/.github/instructions/github-actions-ci-cd-best-practices.instructions.md +607 -0
- mail_parser-4.2.0/.github/instructions/markdown.instructions.md +63 -0
- mail_parser-4.2.0/.github/instructions/python.instructions.md +56 -0
- mail_parser-4.2.0/.github/workflows/main.yml +158 -0
- mail_parser-4.2.0/.gitignore +23 -0
- mail_parser-4.2.0/.markdownlint.json +15 -0
- mail_parser-4.2.0/.pre-commit-config.yaml +56 -0
- mail_parser-4.2.0/Dockerfile +29 -0
- mail_parser-4.2.0/Makefile +56 -0
- mail_parser-4.2.0/PKG-INFO +507 -0
- mail_parser-4.2.0/README.md +482 -0
- mail_parser-4.2.0/docker-compose.yml +10 -0
- mail_parser-4.2.0/docs/images/Bitcoin SpamScope.jpg +0 -0
- mail_parser-4.2.0/pyproject.toml +103 -0
- {mail_parser-4.1.3 → mail_parser-4.2.0}/src/mailparser/__init__.py +0 -2
- {mail_parser-4.1.3 → mail_parser-4.2.0}/src/mailparser/__main__.py +2 -4
- mail_parser-4.2.0/src/mailparser/const.py +101 -0
- {mail_parser-4.1.3 → mail_parser-4.2.0}/src/mailparser/core.py +83 -91
- {mail_parser-4.1.3 → mail_parser-4.2.0}/src/mailparser/exceptions.py +0 -1
- {mail_parser-4.1.3 → mail_parser-4.2.0}/src/mailparser/utils.py +132 -118
- {mail_parser-4.1.3 → mail_parser-4.2.0}/src/mailparser/version.py +1 -2
- mail_parser-4.2.0/tests/mails/mail_malformed_1 +1660 -0
- mail_parser-4.2.0/tests/mails/mail_malformed_2 +60 -0
- mail_parser-4.2.0/tests/mails/mail_malformed_3 +345 -0
- mail_parser-4.2.0/tests/mails/mail_outlook_1 +0 -0
- mail_parser-4.2.0/tests/mails/mail_test_1 +858 -0
- mail_parser-4.2.0/tests/mails/mail_test_10 +4186 -0
- mail_parser-4.2.0/tests/mails/mail_test_11 +856 -0
- mail_parser-4.2.0/tests/mails/mail_test_12 +17 -0
- mail_parser-4.2.0/tests/mails/mail_test_13 +1421 -0
- mail_parser-4.2.0/tests/mails/mail_test_14 +33 -0
- mail_parser-4.2.0/tests/mails/mail_test_15 +5684 -0
- mail_parser-4.2.0/tests/mails/mail_test_16 +26 -0
- mail_parser-4.2.0/tests/mails/mail_test_17 +102 -0
- mail_parser-4.2.0/tests/mails/mail_test_18 +14 -0
- mail_parser-4.2.0/tests/mails/mail_test_2 +19588 -0
- mail_parser-4.2.0/tests/mails/mail_test_3 +55 -0
- mail_parser-4.2.0/tests/mails/mail_test_4 +8257 -0
- mail_parser-4.2.0/tests/mails/mail_test_5 +2919 -0
- mail_parser-4.2.0/tests/mails/mail_test_6 +2414 -0
- mail_parser-4.2.0/tests/mails/mail_test_7 +1434 -0
- mail_parser-4.2.0/tests/mails/mail_test_8 +162 -0
- mail_parser-4.2.0/tests/mails/mail_test_9 +68 -0
- mail_parser-4.2.0/tests/test_improved_received_patterns.py +167 -0
- {mail_parser-4.1.3 → mail_parser-4.2.0}/tests/test_mail_parser.py +490 -52
- mail_parser-4.2.0/tests/test_main.py +360 -0
- mail_parser-4.2.0/tests/test_received_corpus.py +307 -0
- mail_parser-4.2.0/tests/test_utils.py +633 -0
- mail_parser-4.2.0/uv.lock +1322 -0
- mail_parser-4.1.3/PKG-INFO +0 -338
- mail_parser-4.1.3/README.md +0 -300
- mail_parser-4.1.3/pyproject.toml +0 -3
- mail_parser-4.1.3/setup.cfg +0 -72
- mail_parser-4.1.3/setup.py +0 -20
- mail_parser-4.1.3/src/mail_parser.egg-info/PKG-INFO +0 -338
- mail_parser-4.1.3/src/mail_parser.egg-info/SOURCES.txt +0 -21
- mail_parser-4.1.3/src/mail_parser.egg-info/dependency_links.txt +0 -1
- mail_parser-4.1.3/src/mail_parser.egg-info/entry_points.txt +0 -2
- mail_parser-4.1.3/src/mail_parser.egg-info/requires.txt +0 -14
- mail_parser-4.1.3/src/mail_parser.egg-info/top_level.txt +0 -1
- mail_parser-4.1.3/src/mailparser/const.py +0 -98
- mail_parser-4.1.3/tests/test_main.py +0 -172
- {mail_parser-4.1.3 → mail_parser-4.2.0}/LICENSE.txt +0 -0
- {mail_parser-4.1.3 → mail_parser-4.2.0}/NOTICE.txt +0 -0
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Bug report
|
|
3
|
+
about: Create a report to help us improve
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
**Describe the bug**
|
|
8
|
+
A clear and concise description of what the bug is.
|
|
9
|
+
|
|
10
|
+
**To Reproduce**
|
|
11
|
+
Steps to reproduce the behavior:
|
|
12
|
+
|
|
13
|
+
1. `import mailparser`
|
|
14
|
+
2. `mail = mailparser.parse_from_file(f)`
|
|
15
|
+
3. '....'
|
|
16
|
+
4. See error
|
|
17
|
+
|
|
18
|
+
**Expected behavior**
|
|
19
|
+
A clear and concise description of what you expected to happen.
|
|
20
|
+
|
|
21
|
+
**Raw mail**
|
|
22
|
+
The raw mail to reproduce the behavior.
|
|
23
|
+
You can use a `gist` like [this](https://gist.github.com/fedelemantuano/5dd702004c25a46b2bd60de21e67458e).
|
|
24
|
+
The issues without raw mail will be closed.
|
|
25
|
+
|
|
26
|
+
**Environment:**
|
|
27
|
+
|
|
28
|
+
- OS: [e.g. Linux, Windows]
|
|
29
|
+
- Docker: [yes or no]
|
|
30
|
+
- mail-parser version [e.g. 3.6.0]
|
|
31
|
+
|
|
32
|
+
**Additional context**
|
|
33
|
+
Add any other context about the problem here (e.g. stack traceback error).
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Feature request
|
|
3
|
+
about: Suggest an idea for this project
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
**Is your feature request related to a problem? Please describe.**
|
|
8
|
+
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
|
|
9
|
+
|
|
10
|
+
**Describe the solution you'd like**
|
|
11
|
+
A clear and concise description of what you want to happen.
|
|
12
|
+
|
|
13
|
+
**Describe alternatives you've considered**
|
|
14
|
+
A clear and concise description of any alternative solutions or features you've considered.
|
|
15
|
+
|
|
16
|
+
**Additional context**
|
|
17
|
+
Add any other context or screenshots about the feature request here.
|
|
@@ -0,0 +1,226 @@
|
|
|
1
|
+
# Copilot Instructions for mail-parser
|
|
2
|
+
|
|
3
|
+
mail-parser is a **production-grade email parsing library** for Python that transforms raw email messages into
|
|
4
|
+
structured Python objects. Originally built as the foundation for [SpamScope](https://github.com/SpamScope/spamscope),
|
|
5
|
+
it excels at security analysis, forensics, and RFC-compliant email processing.
|
|
6
|
+
|
|
7
|
+
## Core Architecture
|
|
8
|
+
|
|
9
|
+
### Factory-Based API Pattern
|
|
10
|
+
|
|
11
|
+
**Always use factory functions** instead of direct `MailParser()` instantiation:
|
|
12
|
+
|
|
13
|
+
```python
|
|
14
|
+
import mailparser
|
|
15
|
+
mail = mailparser.parse_from_file(filepath) # Standard email files
|
|
16
|
+
mail = mailparser.parse_from_string(raw_email) # Email as string
|
|
17
|
+
mail = mailparser.parse_from_bytes(email_bytes) # Email as bytes
|
|
18
|
+
mail = mailparser.parse_from_file_msg(msg_file) # Outlook .msg files
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
### Triple-Format Property Access
|
|
22
|
+
|
|
23
|
+
Every parsed component offers **three access patterns** (`src/mailparser/core.py:550-570`):
|
|
24
|
+
|
|
25
|
+
```python
|
|
26
|
+
mail.subject # Python object (decoded string)
|
|
27
|
+
mail.subject_raw # Raw header value (JSON list)
|
|
28
|
+
mail.subject_json # JSON-serialized version
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
This pattern applies to all properties via `__getattr__` magic in `core.py`.
|
|
32
|
+
|
|
33
|
+
### Property Naming Convention
|
|
34
|
+
|
|
35
|
+
Headers with hyphens use **underscore substitution** (`core.py:__getattr__`):
|
|
36
|
+
|
|
37
|
+
```python
|
|
38
|
+
mail.X_MSMail_Priority # Accesses "X-MSMail-Priority" header
|
|
39
|
+
mail.Content_Type # Accesses "Content-Type" header
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Development Workflows
|
|
43
|
+
|
|
44
|
+
### Dependency Management with uv
|
|
45
|
+
|
|
46
|
+
The project uses **[uv](https://github.com/astral-sh/uv)** (modern pip/virtualenv replacement) exclusively:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
uv sync # Install all dev/test dependencies (defined in pyproject.toml)
|
|
50
|
+
make install # Alias for uv sync
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
Never use `pip` directly—all commands in Makefile use `uv run` prefix.
|
|
54
|
+
|
|
55
|
+
### Testing Patterns
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
make test # pytest with coverage (generates coverage.xml, junit.xml, htmlcov/)
|
|
59
|
+
make lint # ruff check .
|
|
60
|
+
make format # ruff format .
|
|
61
|
+
make check # lint + test
|
|
62
|
+
make pre-commit # Run all pre-commit hooks
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
When adding features or fixing bugs you MUST follow these steps:
|
|
66
|
+
|
|
67
|
+
1. Add relevant test email to `tests/mails/` if demonstrating new case
|
|
68
|
+
2. Write tests in the corresponding test file following existing patterns, under `tests/`
|
|
69
|
+
3. Run `make test` to verify all tests pass before committing
|
|
70
|
+
4. Run `uv run mail-parser -f tests/mails/mail_test_11 -j` to manually verify JSON output and that new changes
|
|
71
|
+
work as expected
|
|
72
|
+
5. Run `make pre-commit` to ensure code style compliance before pushing
|
|
73
|
+
|
|
74
|
+
**Test data location**: `tests/mails/` contains malformed emails, Outlook files, and various encodings
|
|
75
|
+
(`mail_test_1` through `mail_test_17`, `mail_malformed_1-3`, `mail_outlook_1`).
|
|
76
|
+
|
|
77
|
+
**Critical testing rule**: When modifying parsing logic, test against malformed emails to ensure security defect
|
|
78
|
+
detection still works.
|
|
79
|
+
|
|
80
|
+
### Build & Release Process
|
|
81
|
+
|
|
82
|
+
```bash
|
|
83
|
+
make build # uv build → creates dist/*.tar.gz and dist/*.whl
|
|
84
|
+
make release # build + twine upload to PyPI
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Version is **dynamically loaded** from `src/mailparser/version.py` (see
|
|
88
|
+
`pyproject.toml:tool.hatch.version`).
|
|
89
|
+
|
|
90
|
+
## Security-First Parsing
|
|
91
|
+
|
|
92
|
+
### Defect Detection System
|
|
93
|
+
|
|
94
|
+
The parser identifies RFC violations that could indicate malicious intent (`core.py:240-268`):
|
|
95
|
+
|
|
96
|
+
```python
|
|
97
|
+
mail.has_defects # Boolean flag
|
|
98
|
+
mail.defects # List of defect dicts by content type
|
|
99
|
+
mail.defects_categories # Set of defect class names (e.g., "StartBoundaryNotFoundDefect")
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
**Epilogue defect handling** (`core.py:320-335`): When `EPILOGUE_DEFECTS` are detected, parser extracts hidden
|
|
103
|
+
content between MIME boundaries that could contain malicious payloads.
|
|
104
|
+
|
|
105
|
+
### IP Address Extraction
|
|
106
|
+
|
|
107
|
+
`get_server_ipaddress(trust)` method (`core.py:487-528`) extracts sender IPs with **trust-level validation**:
|
|
108
|
+
|
|
109
|
+
```python
|
|
110
|
+
# Finds first non-private IP in trusted headers
|
|
111
|
+
mail.get_server_ipaddress(trust="Received")
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Filters out private IP ranges using Python's `ipaddress` module.
|
|
115
|
+
|
|
116
|
+
### Received Header Parsing
|
|
117
|
+
|
|
118
|
+
Complex regex-based parsing (`utils.py:302-360`, patterns in `const.py:24-73`) extracts hop-by-hop routing:
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
# Returns list of dicts with: by, from, date, date_utc, delay, envelope_from, hop, with
|
|
122
|
+
mail.received
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
**Key pattern**: `RECEIVED_COMPILED_LIST` contains pre-compiled regexes for "from", "by", "with", "id", "for",
|
|
126
|
+
"via", "envelope-from", "envelope-sender", and date patterns. Recent fixes addressed IBM gateway duplicate matches
|
|
127
|
+
(see comments in `const.py:26-38`).
|
|
128
|
+
|
|
129
|
+
If parsing fails, falls back to `receiveds_not_parsed()` returning `{"raw": <header>, "hop": <n>}`
|
|
130
|
+
structure.
|
|
131
|
+
|
|
132
|
+
## Project Structure Specifics
|
|
133
|
+
|
|
134
|
+
### src/ Layout
|
|
135
|
+
|
|
136
|
+
Package uses modern **src-layout** (`src/mailparser/`) for cleaner imports and testing isolation:
|
|
137
|
+
|
|
138
|
+
```text
|
|
139
|
+
src/mailparser/
|
|
140
|
+
├── __init__.py # Exports factory functions
|
|
141
|
+
├── __main__.py # CLI entry point (mail-parser command)
|
|
142
|
+
├── core.py # MailParser class (760 lines)
|
|
143
|
+
├── utils.py # Parsing utilities (582 lines)
|
|
144
|
+
├── const.py # Regex patterns and constants
|
|
145
|
+
├── exceptions.py # Exception hierarchy
|
|
146
|
+
└── version.py # Version string
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
### External Dependency: Outlook Support
|
|
150
|
+
|
|
151
|
+
Outlook `.msg` file parsing requires **system-level Perl module**:
|
|
152
|
+
|
|
153
|
+
```bash
|
|
154
|
+
apt-get install libemail-outlook-message-perl # Debian/Ubuntu
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
Triggered via `msgconvert()` function in `utils.py` that shells out to Perl script. Raises `MailParserOutlookError`
|
|
158
|
+
if unavailable.
|
|
159
|
+
|
|
160
|
+
### CLI Tool Pattern
|
|
161
|
+
|
|
162
|
+
`__main__.py` provides production CLI with mutually exclusive input modes (`-f`, `-s`, `-k`), JSON output (`-j`),
|
|
163
|
+
and selective printing (`-b`, `-a`, `-r`, `-t`).
|
|
164
|
+
|
|
165
|
+
**Entry point defined** in `pyproject.toml:project.scripts`:
|
|
166
|
+
|
|
167
|
+
```toml
|
|
168
|
+
[project.scripts]
|
|
169
|
+
mail-parser = "mailparser.__main__:main"
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
## Code Style & Tooling
|
|
173
|
+
|
|
174
|
+
### Ruff Configuration
|
|
175
|
+
|
|
176
|
+
Single linter/formatter (replaces black, isort, flake8):
|
|
177
|
+
|
|
178
|
+
```toml
|
|
179
|
+
[tool.ruff.lint]
|
|
180
|
+
select = ["E", "F", "I"] # pycodestyle, pyflakes, isort
|
|
181
|
+
# "UP", "B", "SIM", "S", "PT" commented out in pyproject.toml
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### Pytest Configuration
|
|
185
|
+
|
|
186
|
+
Key markers in `pyproject.toml:tool.pytest.ini_options`:
|
|
187
|
+
|
|
188
|
+
- `integration`: marks integration tests
|
|
189
|
+
- Coverage outputs: XML (for CI), HTML (for local), terminal
|
|
190
|
+
- JUnit XML for CI integration
|
|
191
|
+
|
|
192
|
+
## Common Pitfalls
|
|
193
|
+
|
|
194
|
+
1. **Don't instantiate `MailParser()` directly**—use factory functions from `__init__.py`
|
|
195
|
+
2. **Don't use `pip`**—always use `uv` or Makefile targets
|
|
196
|
+
3. **Don't ignore defects**—they're critical for security analysis
|
|
197
|
+
4. **Don't assume headers exist**—use `.get()` pattern or handle `None`
|
|
198
|
+
5. **Test against malformed emails**—`tests/mails/mail_malformed_*` files exist for this reason
|
|
199
|
+
|
|
200
|
+
## Docker Development
|
|
201
|
+
|
|
202
|
+
Dockerfile uses **Python 3.10-slim-bookworm** with Outlook dependencies pre-installed. Container runs as non-root
|
|
203
|
+
`mailparser` user.
|
|
204
|
+
|
|
205
|
+
```bash
|
|
206
|
+
docker build -t mail-parser .
|
|
207
|
+
docker run mail-parser -f /path/to/email
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## Key Reference Points
|
|
211
|
+
|
|
212
|
+
- **Property implementation**: `core.py:540-730` (all `@property` decorators)
|
|
213
|
+
- **Attachment extraction**: `core.py:355-475` (walks multipart, handles encoding)
|
|
214
|
+
- **Received parsing logic**: `utils.py:302-455` + `const.py:24-73` (regex patterns)
|
|
215
|
+
- **CLI implementation**: `__main__.py:30-347` (argparse + output formatting)
|
|
216
|
+
- **Exception hierarchy**: `exceptions.py:20-60` (5 exception types)
|
|
217
|
+
|
|
218
|
+
## Testing Strategy
|
|
219
|
+
|
|
220
|
+
When adding features:
|
|
221
|
+
|
|
222
|
+
1. Add test email to `tests/mails/` if demonstrating new case
|
|
223
|
+
2. Write tests in `tests/test_mail_parser.py` following existing patterns
|
|
224
|
+
3. Test both normal and `_raw`/`_json` property variants
|
|
225
|
+
4. Verify defect detection for security-relevant changes
|
|
226
|
+
5. Run `make check` before committing
|