emailcanon 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,29 @@
1
+ name: Lint
2
+
3
+ on:
4
+ push:
5
+ branches: [master]
6
+ pull_request:
7
+
8
+ jobs:
9
+ lint:
10
+ runs-on: ubuntu-latest
11
+ steps:
12
+ - uses: actions/checkout@v4
13
+
14
+ - name: Set up Python
15
+ uses: actions/setup-python@v5
16
+ with:
17
+ python-version: "3.12"
18
+
19
+ - name: Install dev dependencies
20
+ run: pip install -e ".[dev]"
21
+
22
+ - name: Ruff (lint)
23
+ run: ruff check .
24
+
25
+ - name: Ruff (format check)
26
+ run: ruff format --check .
27
+
28
+ - name: Mypy (type check)
29
+ run: mypy
@@ -0,0 +1,27 @@
1
+ name: Tests
2
+
3
+ on:
4
+ push:
5
+ branches: [master]
6
+ pull_request:
7
+
8
+ jobs:
9
+ test:
10
+ runs-on: ubuntu-latest
11
+ strategy:
12
+ fail-fast: false
13
+ matrix:
14
+ python-version: ["3.12", "3.13"]
15
+ steps:
16
+ - uses: actions/checkout@v4
17
+
18
+ - name: Set up Python ${{ matrix.python-version }}
19
+ uses: actions/setup-python@v5
20
+ with:
21
+ python-version: ${{ matrix.python-version }}
22
+
23
+ - name: Install package
24
+ run: pip install -e ".[dev]"
25
+
26
+ - name: Run tests
27
+ run: python -m unittest discover -s tests -p "*.py" -v
@@ -0,0 +1,31 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.pyo
5
+ *.pyd
6
+ *.so
7
+ *.egg
8
+ *.egg-info/
9
+ dist/
10
+ build/
11
+ .eggs/
12
+
13
+ # Virtual environments
14
+ .venv/
15
+ venv/
16
+ env/
17
+
18
+ # Type checking
19
+ .mypy_cache/
20
+
21
+ # Testing
22
+ .pytest_cache/
23
+ .coverage
24
+ htmlcov/
25
+
26
+ # Ruff
27
+ .ruff_cache/
28
+
29
+ # IDE
30
+ .vscode/
31
+ .idea/
@@ -0,0 +1,5 @@
1
+ Copyright 2026 grMLEqomlkkU5Eeinz4brIrOVCUCkJuN
2
+
3
+ Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.
4
+
5
+ THE SOFTWARE IS PROVIDED “AS IS” AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
@@ -0,0 +1,367 @@
1
+ Metadata-Version: 2.4
2
+ Name: emailcanon
3
+ Version: 0.1.0
4
+ Summary: A Python library for email canonicalization
5
+ License: MIT
6
+ License-File: LICENSE
7
+ Requires-Python: >=3.12
8
+ Provides-Extra: dev
9
+ Requires-Dist: mypy>=1.10; extra == 'dev'
10
+ Requires-Dist: ruff>=0.4; extra == 'dev'
11
+ Description-Content-Type: text/markdown
12
+
13
+ # emailCanon
14
+
15
+ A Python library for email address canonicalization and normalization. Normalizes email addresses according to provider-specific rules (Gmail, Outlook, Yahoo, etc.) to help identify duplicate accounts and standardize email formats.
16
+
17
+ This is the Python equivalent of [@grml/nomadic](https://github.com/grMLEqomlkkU5Eeinz4brIrOVCUCkJuN/nomad).
18
+
19
+ ## Features
20
+
21
+ - **Provider-aware normalization**: Applies provider-specific rules (sub-address stripping, dot removal, case-folding)
22
+ - **Canonical domain collapsing**: Maps alias domains to canonical domains (e.g., `googlemail.com` becomes `gmail.com`)
23
+ - **Sub-address stripping**: Removes subaddresses (e.g., `user+tag` becomes `user`) per provider
24
+ - **RFC-compliant**: Handles quoted local parts and validates email structure
25
+ - **Customizable**: Extend with custom providers or override default rules
26
+ - **Built-in providers**: Gmail, Microsoft (Outlook/Hotmail), Yahoo, Apple iCloud, Fastmail, ProtonMail, and 10+ others
27
+
28
+ ## Installation
29
+
30
+ ```bash
31
+ pip install emailcanon
32
+ ```
33
+
34
+ ## Quick Start
35
+
36
+ ```python
37
+ from emailcanon import normalizeEmail, getEmailProvider, isSameEmail
38
+
39
+ # Basic normalization
40
+ normalized = normalizeEmail("Test.User+Tag@Gmail.com")
41
+ print(normalized) # "testuser@gmail.com"
42
+
43
+ # Get the provider ID
44
+ provider = getEmailProvider("user@outlook.com")
45
+ print(provider) # "microsoft"
46
+
47
+ # Check if two emails are equivalent
48
+ same = isSameEmail("john.doe+newsletter@gmail.com", "johndoe@googlemail.com")
49
+ print(same) # True
50
+ ```
51
+
52
+ ## API Reference
53
+
54
+ ### `normalizeEmail(email: str, options: NormalizeOptions | None = None) -> str`
55
+
56
+ Normalizes an email address to its canonical form.
57
+
58
+ Applies provider-specific rules including:
59
+ - Sub-address stripping (e.g., `+tag` for Gmail)
60
+ - Dot removal (Gmail ignores dots in the local part)
61
+ - Case folding of the local part
62
+ - Canonical domain mapping (alias domains to primary domain)
63
+
64
+ **Parameters:**
65
+ - `email`: The email address to normalize
66
+ - `options`: Optional normalization options (see `NormalizeOptions` below)
67
+
68
+ **Returns:** Canonical email string. The string is returned regardless of
69
+ whether the address is structurally valid; this function discards the `valid`
70
+ flag. If you need the validity flag or the parsed components, use
71
+ [`normalizeEmailDetailed`](#normalizeemaildetailedemail-str-options-normalizeoptions--none--none---normalizedemail).
72
+
73
+ **Raises:** `TypeError` if email is not a string
74
+
75
+ **Examples:**
76
+ ```python
77
+ normalizeEmail("john.doe+newsletter@GMAIL.COM") # "johndoe@gmail.com"
78
+ normalizeEmail("first.name@outlook.com") # "firstname@outlook.com"
79
+ normalizeEmail("user-tag@yahoo.com") # "user@yahoo.com"
80
+ normalizeEmail("\"quoted.local\"@example.com") # "\"quoted.local\"@example.com"
81
+ ```
82
+
83
+ ### `normalizeEmailDetailed(email: str, options: NormalizeOptions | None = None) -> NormalizedEmail`
84
+
85
+ Normalizes an email and returns detailed information about the normalization.
86
+
87
+ **Returns:** `NormalizedEmail` object with:
88
+ - `normalized`: Canonical email string
89
+ - `local`: Normalized local part
90
+ - `domain`: Normalized domain
91
+ - `providerId`: ID of matched provider (e.g., "gmail"), or `None`
92
+ - `subaddress`: Extracted sub-address (e.g., "tag" from "user+tag"), or `None`
93
+ (an empty tag such as `user+` yields `None`, just like having no separator)
94
+ - `valid`: Whether the email is structurally valid (see
95
+ [Validity flag limitations](#validity-flag-limitations))
96
+
97
+ **Example:**
98
+ ```python
99
+ result = normalizeEmailDetailed("john+newsletter@gmail.com")
100
+ # NormalizedEmail(
101
+ # normalized="john@gmail.com",
102
+ # local="john",
103
+ # domain="gmail.com",
104
+ # providerId="gmail",
105
+ # subaddress="newsletter",
106
+ # valid=True
107
+ # )
108
+ ```
109
+
110
+ ### `getEmailProvider(email: str, options: NormalizeOptions | None = None) -> str | None`
111
+
112
+ Returns the provider ID for an email address, or `None` if no provider matches.
113
+
114
+ **Parameters:**
115
+ - `email`: The email address
116
+ - `options`: Optional normalization options
117
+
118
+ **Returns:** Provider ID string (e.g., "gmail", "microsoft") or `None`
119
+
120
+ **Raises:** `TypeError` if email is not a string
121
+
122
+ **Example:**
123
+ ```python
124
+ getEmailProvider("user@gmail.com") # "gmail"
125
+ getEmailProvider("name@outlook.com") # "microsoft"
126
+ getEmailProvider("person@example.com") # None
127
+ ```
128
+
129
+ ### `isSameEmail(a: str, b: str, options: NormalizeOptions | None = None) -> bool`
130
+
131
+ Checks if two email addresses normalize to the same canonical form.
132
+
133
+ Useful for detecting duplicate accounts where users registered with different aliases.
134
+
135
+ **Parameters:**
136
+ - `a`, `b`: Email addresses to compare
137
+ - `options`: Optional normalization options
138
+
139
+ **Returns:** `True` if both emails normalize identically, `False` otherwise
140
+
141
+ **Example:**
142
+ ```python
143
+ isSameEmail("john.doe@gmail.com", "johndoe+spam@googlemail.com") # True
144
+ isSameEmail("john@example.com", "jane@example.com") # False
145
+ ```
146
+
147
+ ## Configuration
148
+
149
+ ### `NormalizeOptions`
150
+
151
+ Control normalization behavior via options:
152
+
153
+ ```python
154
+ from emailcanon import NormalizeOptions, ProviderRule, normalizeEmail
155
+
156
+ options = NormalizeOptions(
157
+ lowercaseDomain=True, # Default: True
158
+ providers=[ # Custom providers to add/override
159
+ ProviderRule(
160
+ id="custom",
161
+ domains=["custom.example.com"],
162
+ lowercaseLocal=True,
163
+ removeDots=True,
164
+ subaddressSeparators=["+"]
165
+ )
166
+ ],
167
+ replaceDefaultProviders=False, # Keep built-in providers
168
+ defaultRule=None # Rule for unknown domains
169
+ )
170
+
171
+ normalized = normalizeEmail("user@custom.example.com", options)
172
+ ```
173
+
174
+ **Caching note:** The provider registry built from a `NormalizeOptions`
175
+ instance is memoized per options object (keyed on identity), so reusing the
176
+ same `options` across many calls — the typical bulk-deduplication pattern —
177
+ avoids rebuilding the registry every time. Because the cache is keyed on object
178
+ identity, mutating an `options` object after its first use will not be
179
+ reflected; construct a fresh `NormalizeOptions` instead of mutating one.
180
+
181
+ ### `ProviderRule`
182
+
183
+ Define custom email provider rules:
184
+
185
+ ```python
186
+ ProviderRule(
187
+ id="my_provider", # Unique identifier
188
+ domains=["example.com", "mail.example.com"], # Domain patterns
189
+ lowercaseLocal=True, # Convert local part to lowercase
190
+ removeDots=True, # Remove dots from local part
191
+ subaddressSeparators=["+", "-"], # Characters that separate subaddress
192
+ canonicalDomain="example.com" # Map all domains to this
193
+ )
194
+ ```
195
+
196
+ ## Supported Providers
197
+
198
+ | Provider | ID | Domains | Rules |
199
+ |----------|----|---------| ----- |
200
+ | **Gmail** | `gmail` | gmail.com, googlemail.com | Lowercase, remove dots, `+` subaddress, maps to gmail.com |
201
+ | **Microsoft** | `microsoft` | outlook.com*, hotmail.com*, live.com*, msn.com, others | Lowercase, `+` subaddress |
202
+ | **Yahoo** | `yahoo` | yahoo.com*, ymail.com, rocketmail.com | Lowercase, `-` subaddress |
203
+ | **Apple iCloud** | `icloud` | icloud.com, me.com, mac.com | Lowercase, `+` subaddress |
204
+ | **Fastmail** | `fastmail` | fastmail.com, fastmail.fm | Lowercase, `+` subaddress |
205
+ | **ProtonMail** | `proton` | protonmail.com, protonmail.ch, proton.me, pm.me | Lowercase, `+` subaddress |
206
+ | **Yandex** | `yandex` | yandex.com, yandex.ru, ya.ru, others | Lowercase, `+` subaddress |
207
+ | **Zoho** | `zoho` | zoho.com, zohomail.com, zoho.eu | Lowercase, `+` subaddress |
208
+ | **Tutanota** | `tutanota` | tutanota.com, tutanota.de, tutamail.com, tuta.com, others | Lowercase, `+` subaddress |
209
+ | **Posteo** | `posteo` | posteo.de, posteo.net | Lowercase, `+` subaddress |
210
+ | **Mailbox.org** | `mailbox` | mailbox.org | Lowercase, `+` subaddress |
211
+ | **Mailfence** | `mailfence` | mailfence.com | Lowercase, `+` subaddress |
212
+ | **Runbox** | `runbox` | runbox.com | Lowercase, `+` subaddress |
213
+ | **Pobox** | `pobox` | pobox.com | Lowercase, `+` subaddress |
214
+ | **AOL** | `aol` | aol.com, aim.com | Lowercase |
215
+
216
+ *\* Multiple regional variants supported (com.au, co.uk, de, fr, etc.)*
217
+
218
+ ## Examples
219
+
220
+ ### Duplicate Account Detection
221
+
222
+ ```python
223
+ from emailcanon import normalizeEmail
224
+
225
+ emails = [
226
+ "john.doe@gmail.com",
227
+ "johndoe+shopping@googlemail.com",
228
+ "john.doe@yahoo.com",
229
+ "J.DOE@GMAIL.COM"
230
+ ]
231
+
232
+ # Group by normalized form
233
+ normalized_map = {}
234
+ for email in emails:
235
+ norm = normalizeEmail(email)
236
+ if norm not in normalized_map:
237
+ normalized_map[norm] = []
238
+ normalized_map[norm].append(email)
239
+
240
+ # Find duplicates
241
+ for norm_email, originals in normalized_map.items():
242
+ if len(originals) > 1:
243
+ print(f"Duplicates: {originals} maps to {norm_email}")
244
+ # Output: Duplicates: ['john.doe@gmail.com', 'johndoe+shopping@googlemail.com', 'J.DOE@GMAIL.COM'] maps to john@gmail.com
245
+ ```
246
+
247
+ ### Custom Provider Rules
248
+
249
+ ```python
250
+ from emailcanon import normalizeEmail, NormalizeOptions, ProviderRule
251
+
252
+ # Add custom provider
253
+ options = NormalizeOptions(
254
+ providers=[
255
+ ProviderRule(
256
+ id="company",
257
+ domains=["company.com", "corp.company.com"],
258
+ canonicalDomain="company.com",
259
+ lowercaseLocal=True,
260
+ subaddressSeparators=["+"]
261
+ )
262
+ ]
263
+ )
264
+
265
+ normalizeEmail("User+Team@corp.company.com", options)
266
+ # "user@company.com"
267
+ ```
268
+
269
+ ### Skip Default Providers
270
+
271
+ ```python
272
+ from emailcanon import normalizeEmail, NormalizeOptions, ProviderRule
273
+
274
+ # Use only custom providers, ignore built-in ones
275
+ options = NormalizeOptions(
276
+ replaceDefaultProviders=True,
277
+ providers=[
278
+ ProviderRule(
279
+ id="custom",
280
+ domains=["custom.local"],
281
+ lowercaseLocal=True
282
+ )
283
+ ]
284
+ )
285
+
286
+ normalizeEmail("User@custom.local", options)
287
+ # "user@custom.local"
288
+ ```
289
+
290
+ ## Design Notes
291
+
292
+ - **Conservative by default**: Unknown domains get minimal normalization (lowercase domain only, local part unchanged)
293
+ - **Quoted strings**: RFC 5321 quoted local parts (e.g., `"user name"@example.com`) are preserved verbatim
294
+ - **Domain validation**: Domains must follow standard DNS naming (labels separated by dots, alphanumeric + hyphens)
295
+ - **Immutable rules**: Provider rules are frozen dataclasses; mutation is not possible
296
+
297
+ ### Validity flag limitations
298
+
299
+ The `valid` flag returned by `normalizeEmailDetailed` is a pragmatic structural
300
+ check, not a full RFC 5321/5322 validator. In particular, the domain check has
301
+ two known limitations:
302
+
303
+ - **Single-label hosts are rejected.** The domain must contain at least one dot,
304
+ so hosts like `localhost` are reported as `valid=False`, even though they are
305
+ deliverable in some environments.
306
+ - **Non-ASCII / IDN domains are rejected.** The check is ASCII-only, so
307
+ internationalized domains such as `münchen.de` are reported as
308
+ `valid=False`. Pre-encode them to Punycode (`xn--mnchen-3ya.de`) if you need
309
+ them to pass.
310
+
311
+ These limitations only affect the `valid` flag; normalization of the local part
312
+ and domain is still performed in all cases.
313
+
314
+ ## Why Email Canonicalization?
315
+
316
+ Email addresses can look different but deliver to the same mailbox:
317
+
318
+ | Input | Gmail Reality |
319
+ |-------|---------------|
320
+ | `john.doe@gmail.com` | `johndoe@gmail.com` (dots ignored) |
321
+ | `johndoe+newsletter@gmail.com` | `johndoe@gmail.com` (subaddress stripped) |
322
+ | `johndoe@googlemail.com` | `johndoe@gmail.com` (domain alias) |
323
+
324
+ Without canonicalization, a user could register multiple accounts. emailCanon standardizes these to detect and prevent duplicate registrations.
325
+
326
+ ## Development
327
+
328
+ Set up a local virtual environment and install the package with its dev
329
+ dependencies (mypy, ruff):
330
+
331
+ ```bash
332
+ # Create and activate a virtual environment
333
+ python -m venv .venv
334
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
335
+
336
+ # Install the package in editable mode with dev extras
337
+ pip install -e ".[dev]"
338
+ ```
339
+
340
+ Then run the tooling:
341
+
342
+ ```bash
343
+ # Run the test suite (the test files are named after the area they cover,
344
+ # e.g. gmail.py, so discovery needs an explicit pattern)
345
+ python -m unittest discover -s tests -p "*.py"
346
+
347
+ # ...or run a single test module directly
348
+ python -m unittest tests.gmail
349
+
350
+ mypy # type-check
351
+ ruff format # format with tabs
352
+ ruff check # lint
353
+ ```
354
+
355
+ When you're done, leave the environment with `deactivate`.
356
+
357
+ ## References
358
+
359
+ Provider rules compiled from:
360
+ - Wikipedia, [Email address](https://en.wikipedia.org/wiki/Email_address) (sub-addressing section)
361
+ - [aaronbassett's Email sub-addressing gist](https://gist.github.com/aaronbassett/2f8b3a26cf54e5e1fc9c)
362
+ - [validator.js](https://github.com/validatorjs/validator.js) normalizeEmail conventions
363
+ - Official provider documentation (Fastmail, Microsoft Learn, Proton)
364
+
365
+ ## License
366
+
367
+ MIT