bytesense 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Oğuzhan Kır
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,327 @@
1
+ Metadata-Version: 2.4
2
+ Name: bytesense
3
+ Version: 0.1.0
4
+ Classifier: Development Status :: 3 - Alpha
5
+ Classifier: Intended Audience :: Developers
6
+ Classifier: License :: OSI Approved :: MIT License
7
+ Classifier: Operating System :: OS Independent
8
+ Classifier: Programming Language :: Python
9
+ Classifier: Programming Language :: Python :: 3
10
+ Classifier: Programming Language :: Python :: 3.8
11
+ Classifier: Programming Language :: Python :: 3.9
12
+ Classifier: Programming Language :: Python :: 3.10
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Classifier: Programming Language :: Python :: 3.12
15
+ Classifier: Programming Language :: Python :: 3.13
16
+ Classifier: Programming Language :: Python :: 3 :: Only
17
+ Classifier: Programming Language :: Python :: Implementation :: CPython
18
+ Classifier: Programming Language :: Python :: Implementation :: PyPy
19
+ Classifier: Programming Language :: Rust
20
+ Classifier: Topic :: Text Processing :: Linguistic
21
+ Classifier: Topic :: Utilities
22
+ Classifier: Typing :: Typed
23
+ Requires-Dist: pytest>=8.0 ; extra == 'dev'
24
+ Requires-Dist: pytest-cov>=5.0 ; extra == 'dev'
25
+ Requires-Dist: pytest-benchmark>=4.0 ; extra == 'dev'
26
+ Requires-Dist: mypy>=1.8 ; extra == 'dev'
27
+ Requires-Dist: ruff>=0.4 ; extra == 'dev'
28
+ Requires-Dist: maturin>=1.4 ; extra == 'dev'
29
+ Requires-Dist: charset-normalizer>=3.3 ; extra == 'dev'
30
+ Requires-Dist: chardet>=5.0 ; extra == 'dev'
31
+ Requires-Dist: requests>=2.31 ; extra == 'dev'
32
+ Requires-Dist: certifi>=2023.0 ; extra == 'dev'
33
+ Requires-Dist: twine>=5.0 ; extra == 'dev'
34
+ Provides-Extra: dev
35
+ Provides-Extra: fast
36
+ License-File: LICENSE
37
+ Summary: Fast, accurate charset/encoding detection. Zero ML. Zero dependencies. Optional Rust acceleration.
38
+ Keywords: encoding,charset,detection,unicode,bytes,chardet,charset-normalizer
39
+ Author-email: Oğuzhan Kır <oguzhankir17@gmail.com>
40
+ Maintainer-email: Oğuzhan Kır <oguzhankir17@gmail.com>
41
+ License: MIT
42
+ Requires-Python: >=3.8
43
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
44
+ Project-URL: Changelog, https://github.com/oguzhankir/bytesense/blob/main/CHANGELOG.md
45
+ Project-URL: Homepage, https://github.com/oguzhankir/bytesense
46
+ Project-URL: Issues, https://github.com/oguzhankir/bytesense/issues
47
+ Project-URL: Repository, https://github.com/oguzhankir/bytesense
48
+
49
+ <p align="center">
50
+ <img src="assets/bytesense_logo.svg" alt="bytesense logo" width="520" />
51
+ </p>
52
+
53
+ <p align="center">
54
+ <strong>Charset detection that stays fast, honest, and dependency-free.</strong><br>
55
+ </p>
56
+
57
+ <p align="center">
58
+ <a href="https://pypi.org/project/bytesense/">
59
+ <img src="https://img.shields.io/pypi/v/bytesense.svg" alt="PyPI version" />
60
+ </a>
61
+ <a href="https://pypi.org/project/bytesense/">
62
+ <img src="https://img.shields.io/pypi/pyversions/bytesense.svg" alt="Python versions" />
63
+ </a>
64
+ <a href="https://github.com/oguzhankir/bytesense/actions/workflows/ci.yml">
65
+ <img src="https://github.com/oguzhankir/bytesense/actions/workflows/ci.yml/badge.svg" alt="CI" />
66
+ </a>
67
+ <a href="https://codecov.io/gh/oguzhankir/bytesense">
68
+ <img src="https://codecov.io/gh/oguzhankir/bytesense/graph/badge.svg" alt="codecov" />
69
+ </a>
70
+ <a href="LICENSE">
71
+ <img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT" />
72
+ </a>
73
+ </p>
74
+
75
+ <p align="center">
76
+ <sub>Author: <strong>Oğuzhan Kır</strong> · <a href="https://github.com/oguzhankir/bytesense">GitHub</a> · <a href="docs/">Docs</a></sub>
77
+ </p>
78
+
79
+ ---
80
+
81
+ **bytesense** reads raw bytes and tells you which encoding likely produced them—without shipping neural nets, without pulling in other Python packages at install time, and with an explainable `why` on every result. It is designed as a modern alternative to **chardet** and **charset-normalizer** for teams that want predictable performance and a small install footprint.
82
+
83
+ If you already use `chardet.detect()` or `charset_normalizer.detect()`, you can swap in `bytesense.detect()` with minimal code churn.
84
+
85
+ ## Comparison with chardet & charset-normalizer
86
+
87
+ Same idea as the [charset-normalizer README](https://github.com/Ousret/charset_normalizer): a **feature matrix** and **measured** accuracy on the bundled suites, so you can see where bytesense stands—not hand-wavy marketing.
88
+
89
+ ### Feature matrix
90
+
91
+ | Feature | [Chardet](https://github.com/chardet/chardet) | [Charset Normalizer](https://github.com/Ousret/charset_normalizer) | **bytesense** |
92
+ |---------|:---:|:---:|:---:|
93
+ | Fast | ✅ | ✅ | ✅ |
94
+ | Universal¹ | ❌ | ✅ | ✅ |
95
+ | Reliable without distinguishable standards | ✅ | ✅ | ✅ |
96
+ | Reliable with distinguishable standards | ✅ | ✅ | ✅ |
97
+ | License | MIT | MIT | MIT |
98
+ | Native Python (no C extension required) | ✅ | ✅ | ✅ |
99
+ | Optional native acceleration | — | — | Rust (`pip install "bytesense[fast]"`) |
100
+ | Detect spoken language | ✅ | ✅ | ✅ |
101
+ | Explainable detection (`why`) | Partial | Rich metadata | **Yes** (always) |
102
+ | Streaming-first API | Limited | Via API patterns | **`StreamDetector`** |
103
+ | Wheel size (typical) | ~500 kB | ~150 kB | Small (pure Python + tables) |
104
+ | Custom codecs via stdlib registration | ❌ | ✅ | ✅ (same `codecs` model) |
105
+
106
+ ¹ *Universal* in the charset-normalizer sense: coverage follows **what your Python build exposes through `codecs`** as decode candidates—not a fixed “99 encodings” count on every platform.
107
+
108
+ ### Accuracy (bundled tests, three-way)
109
+
110
+ Numbers below are from the same **`./scripts/compare_libraries.sh`** run the CI exercises: **chardet**, **charset-normalizer**, and **bytesense** on identical inputs. **Strict** = detected codec name matches the reference (after `codecs.lookup` aliases). **Functional** = decoded Unicode matches the reference encoding (multiple labels can count if they decode to the same text—same spirit as charset-normalizer’s comparisons).
111
+
112
+ **39-case suite** (synthetic + charset-normalizer’s published `data/` samples, expected labels aligned with their tests):
113
+
114
+ | Package | Strict | Functional |
115
+ |---------|--------|------------|
116
+ | **bytesense** | **100%** (39/39) | **100%** (39/39) |
117
+ | charset-normalizer | 92.3% (36/39) | 97.4% (38/39) |
118
+ | chardet | 79.5% (31/39) | 97.4% (38/39) |
119
+
120
+ **24-case hard stress** (paragraph-sized, ambiguous SBCS / CJK / UTF-16 / ISO-2022 — `benchmarks/test_hard_scenarios.py`):
121
+
122
+ | Package | Strict | Functional |
123
+ |---------|--------|------------|
124
+ | **bytesense** | **83.3%** (20/24) | **100%** (24/24) |
125
+ | charset-normalizer | 41.7% (10/24) | 58.3% (14/24) |
126
+ | chardet | 70.8% (17/24) | 91.7% (22/24) |
127
+
128
+ *Updated March 2026 — CPython 3.12, chardet 7.x, charset-normalizer 3.4.x, bytesense 0.1.0. Your CPU and dependency versions will differ; re-run `./scripts/compare_libraries.sh` to refresh.*
129
+
130
+ ### Speed (pytest-benchmark, same machine snapshot)
131
+
132
+ Rough single-thread means (`pytest benchmarks/test_bench_detection.py`, speed filter; **your numbers will differ**):
133
+
134
+ | Sample (`from_bytes` / `detect`) | bytesense | chardet |
135
+ |----------------------------------|-----------|---------|
136
+ | `utf8_bom` | ~4 µs | ~104 µs |
137
+ | `utf8_ascii_only` | ~46 µs | ~114 µs |
138
+
139
+ UTF-8 with BOM is an early exit in bytesense; chardet still runs its full probe. Profile your own payloads.
140
+
141
+ ### How to reproduce
142
+
143
+ ```bash
144
+ pip install -e ".[dev]"
145
+ python scripts/build_fingerprints.py
146
+ python scripts/fetch_cn_benchmark_samples.py # CN `data/` mirrors into benchmarks/data/cn_official/
147
+ ./scripts/compare_libraries.sh # prints the tables above (accuracy + hard stress)
148
+ ```
149
+
150
+ Options: `./scripts/compare_libraries.sh --no-fetch` (no download), `--no-hard` (skip 24-case stress). Full benches: `./scripts/run_all_benchmarks.sh`.
151
+
152
+ **Fingerprint script:** Lines like `utf_16` “skipped” are normal — UTF-16/32 are handled via BOM / null-byte rules, not the single-byte histogram table.
153
+
154
+ **Troubleshooting `fetch_cn_benchmark_samples.py` (SSL on macOS):** Use `.venv/bin/python scripts/...`. `certifi` is in `[dev]`. If HTTPS still fails: *Install Certificates.command* for your Python, or `python scripts/fetch_cn_benchmark_samples.py --insecure` as a last resort.
155
+
156
+ ## Installation
157
+
158
+ ```bash
159
+ pip install bytesense # pure Python — always works
160
+ pip install "bytesense[fast]" # same API; uses Rust wheel when available for your platform
161
+ ```
162
+
163
+ **Test PyPI** (pre-release smoke test): after publishing to [test.pypi.org](https://test.pypi.org/), install with an extra index so pip can fetch build tools from PyPI:
164
+
165
+ ```bash
166
+ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ bytesense
167
+ ```
168
+
169
+ ## CLI
170
+
171
+ `bytesense` ships with a small CLI for files on disk.
172
+
173
+ ```text
174
+ usage: bytesense [-h] [-v] [-m] [--version] FILE [FILE ...]
175
+
176
+ positional arguments:
177
+ FILE File(s) to analyse
178
+
179
+ optional arguments:
180
+ -h, --help show this help and exit
181
+ -v, --verbose Include the ``why`` field in JSON (``confidence_interval`` is always shown)
182
+ -m, --minimal Print only the detected encoding name
183
+ --version Show version and exit
184
+ ```
185
+
186
+ ```bash
187
+ bytesense ./README.md
188
+ bytesense -m ./README.md
189
+ python -m bytesense.cli --version
190
+ ```
191
+
192
+ `stdout` is JSON (one object per file, or a list when multiple files and not `-m`):
193
+
194
+ ```json
195
+ {
196
+ "encoding": "utf_8",
197
+ "confidence": 0.98,
198
+ "confidence_interval": [0.93, 1.0],
199
+ "language": "English",
200
+ "alternatives": [],
201
+ "bom_detected": false,
202
+ "chaos": 0.02,
203
+ "coherence": 0.41,
204
+ "byte_count": 2048,
205
+ "path": "/absolute/path/to/file"
206
+ }
207
+ ```
208
+
209
+ Use ``-v`` to add the human-readable ``why`` string to each object.
210
+
211
+ ## Python
212
+
213
+ **Full result object**
214
+
215
+ ```python
216
+ from bytesense import from_bytes, from_path
217
+
218
+ result = from_path("notes.txt")
219
+ print(result.encoding, result.confidence, result.language)
220
+ print(result.why)
221
+ ```
222
+
223
+ **Streaming**
224
+
225
+ ```python
226
+ from bytesense import StreamDetector
227
+
228
+ det = StreamDetector()
229
+ for chunk in response.iter_content(1024):
230
+ det.feed(chunk)
231
+ if det.confidence >= 0.99:
232
+ break
233
+ print(det.encoding, det.language)
234
+ ```
235
+
236
+ ## Repair mojibake
237
+
238
+ ```python
239
+ from bytesense import repair
240
+
241
+ garbled = "été" # UTF-8 read as Latin-1
242
+ result = repair(garbled)
243
+ if result.improved:
244
+ print(result.repaired) # "été"
245
+ print(result.chain) # ("latin_1", "utf_8")
246
+ print(result.improvement) # e.g. 0.34
247
+ ```
248
+
249
+ ## Stream from HTTP
250
+
251
+ ```python
252
+ from bytesense import detect_stream
253
+ import urllib.request
254
+
255
+ with urllib.request.urlopen("https://example.com") as resp:
256
+ result = detect_stream(
257
+ iter(lambda: resp.read(1024), b""),
258
+ stop_confidence=0.99,
259
+ )
260
+ print(result.encoding)
261
+ ```
262
+
263
+ ## HTML/XML hints
264
+
265
+ ```python
266
+ from bytesense import best_hint, from_bytes
267
+
268
+ html = b'<meta charset="cp1252"><p>Hëllo</p>'
269
+ hint = best_hint(html, headers={"Content-Type": "text/html"})
270
+ result = from_bytes(html)
271
+ print(hint, result.encoding)
272
+ ```
273
+
274
+ ## Multi-encoding documents
275
+
276
+ ```python
277
+ from bytesense import detect_multi
278
+
279
+ # Example: bytes from a legacy .eml or mixed scrape
280
+ your_bytes = b"..." # replace with your document
281
+ result = detect_multi(your_bytes)
282
+ print(f"Uniform encoding: {result.is_uniform}")
283
+ for seg in result.segments:
284
+ print(f" [{seg.start}:{seg.end}] {seg.encoding} — {seg.text[:40]!r}")
285
+ ```
286
+
287
+ **Drop-in `detect()`**
288
+
289
+ ```python
290
+ from bytesense import detect
291
+
292
+ print(detect(b"hello")) # {"encoding", "confidence", "language"}
293
+ ```
294
+
295
+ More detail: [docs/api.md](docs/api.md) · [docs/quickstart.md](docs/quickstart.md)
296
+
297
+ ## Why bytesense
298
+
299
+ - **Decode as late as possible.** Histograms, BOMs, and null-byte layout often rule out whole families of encodings before you spend CPU on full decodes.
300
+ - **Shortlist, then verify.** Cosine similarity against pre-generated fingerprints (see `scripts/build_fingerprints.py`) keeps the expensive “mess + coherence” phase on a handful of candidates.
301
+ - **No black boxes.** No training step, no weights to tune, no network calls—just tables and statistics you can inspect.
302
+ - **Rust is optional.** `pip install bytesense` never requires a compiler; Rust only accelerates hot paths when a wheel matches your platform.
303
+
304
+ ## How it works (short)
305
+
306
+ 1. **Fingerprint** the byte distribution and compare to pre-computed vectors.
307
+ 2. **Decode** only the shortlisted encodings (strict), in a controlled order.
308
+ 3. **Mess** — score how “garbled” the decoded text looks (printable ratio, bigrams, etc.).
309
+ 4. **Coherence** — score language plausibility using character-frequency priors.
310
+ 5. **Rank** and return the best hypothesis plus a human-readable **why**.
311
+
312
+ ## Known limitations
313
+
314
+ - Very short inputs (dozens of bytes) are inherently ambiguous; any detector will guess.
315
+ - Mixed-language text can confuse language coherence.
316
+ - Like any heuristic detector, adversarial or random binary data may yield a best-effort encoding with low confidence.
317
+
318
+ ## Contributing
319
+
320
+ Issues and PRs are welcome: [CONTRIBUTING.md](CONTRIBUTING.md) · [Issues](https://github.com/oguzhankir/bytesense/issues)
321
+
322
+ ## License
323
+
324
+ MIT — see [LICENSE](LICENSE).
325
+
326
+ Copyright © Oğuzhan Kır.
327
+
@@ -0,0 +1,278 @@
1
+ <p align="center">
2
+ <img src="assets/bytesense_logo.svg" alt="bytesense logo" width="520" />
3
+ </p>
4
+
5
+ <p align="center">
6
+ <strong>Charset detection that stays fast, honest, and dependency-free.</strong><br>
7
+ </p>
8
+
9
+ <p align="center">
10
+ <a href="https://pypi.org/project/bytesense/">
11
+ <img src="https://img.shields.io/pypi/v/bytesense.svg" alt="PyPI version" />
12
+ </a>
13
+ <a href="https://pypi.org/project/bytesense/">
14
+ <img src="https://img.shields.io/pypi/pyversions/bytesense.svg" alt="Python versions" />
15
+ </a>
16
+ <a href="https://github.com/oguzhankir/bytesense/actions/workflows/ci.yml">
17
+ <img src="https://github.com/oguzhankir/bytesense/actions/workflows/ci.yml/badge.svg" alt="CI" />
18
+ </a>
19
+ <a href="https://codecov.io/gh/oguzhankir/bytesense">
20
+ <img src="https://codecov.io/gh/oguzhankir/bytesense/graph/badge.svg" alt="codecov" />
21
+ </a>
22
+ <a href="LICENSE">
23
+ <img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT" />
24
+ </a>
25
+ </p>
26
+
27
+ <p align="center">
28
+ <sub>Author: <strong>Oğuzhan Kır</strong> · <a href="https://github.com/oguzhankir/bytesense">GitHub</a> · <a href="docs/">Docs</a></sub>
29
+ </p>
30
+
31
+ ---
32
+
33
+ **bytesense** reads raw bytes and tells you which encoding likely produced them—without shipping neural nets, without pulling in other Python packages at install time, and with an explainable `why` on every result. It is designed as a modern alternative to **chardet** and **charset-normalizer** for teams that want predictable performance and a small install footprint.
34
+
35
+ If you already use `chardet.detect()` or `charset_normalizer.detect()`, you can swap in `bytesense.detect()` with minimal code churn.
36
+
37
+ ## Comparison with chardet & charset-normalizer
38
+
39
+ Same idea as the [charset-normalizer README](https://github.com/Ousret/charset_normalizer): a **feature matrix** and **measured** accuracy on the bundled suites, so you can see where bytesense stands—not hand-wavy marketing.
40
+
41
+ ### Feature matrix
42
+
43
+ | Feature | [Chardet](https://github.com/chardet/chardet) | [Charset Normalizer](https://github.com/Ousret/charset_normalizer) | **bytesense** |
44
+ |---------|:---:|:---:|:---:|
45
+ | Fast | ✅ | ✅ | ✅ |
46
+ | Universal¹ | ❌ | ✅ | ✅ |
47
+ | Reliable without distinguishable standards | ✅ | ✅ | ✅ |
48
+ | Reliable with distinguishable standards | ✅ | ✅ | ✅ |
49
+ | License | MIT | MIT | MIT |
50
+ | Native Python (no C extension required) | ✅ | ✅ | ✅ |
51
+ | Optional native acceleration | — | — | Rust (`pip install "bytesense[fast]"`) |
52
+ | Detect spoken language | ✅ | ✅ | ✅ |
53
+ | Explainable detection (`why`) | Partial | Rich metadata | **Yes** (always) |
54
+ | Streaming-first API | Limited | Via API patterns | **`StreamDetector`** |
55
+ | Wheel size (typical) | ~500 kB | ~150 kB | Small (pure Python + tables) |
56
+ | Custom codecs via stdlib registration | ❌ | ✅ | ✅ (same `codecs` model) |
57
+
58
+ ¹ *Universal* in the charset-normalizer sense: coverage follows **what your Python build exposes through `codecs`** as decode candidates—not a fixed “99 encodings” count on every platform.
59
+
60
+ ### Accuracy (bundled tests, three-way)
61
+
62
+ Numbers below are from the same **`./scripts/compare_libraries.sh`** run the CI exercises: **chardet**, **charset-normalizer**, and **bytesense** on identical inputs. **Strict** = detected codec name matches the reference (after `codecs.lookup` aliases). **Functional** = decoded Unicode matches the reference encoding (multiple labels can count if they decode to the same text—same spirit as charset-normalizer’s comparisons).
63
+
64
+ **39-case suite** (synthetic + charset-normalizer’s published `data/` samples, expected labels aligned with their tests):
65
+
66
+ | Package | Strict | Functional |
67
+ |---------|--------|------------|
68
+ | **bytesense** | **100%** (39/39) | **100%** (39/39) |
69
+ | charset-normalizer | 92.3% (36/39) | 97.4% (38/39) |
70
+ | chardet | 79.5% (31/39) | 97.4% (38/39) |
71
+
72
+ **24-case hard stress** (paragraph-sized, ambiguous SBCS / CJK / UTF-16 / ISO-2022 — `benchmarks/test_hard_scenarios.py`):
73
+
74
+ | Package | Strict | Functional |
75
+ |---------|--------|------------|
76
+ | **bytesense** | **83.3%** (20/24) | **100%** (24/24) |
77
+ | charset-normalizer | 41.7% (10/24) | 58.3% (14/24) |
78
+ | chardet | 70.8% (17/24) | 91.7% (22/24) |
79
+
80
+ *Updated March 2026 — CPython 3.12, chardet 7.x, charset-normalizer 3.4.x, bytesense 0.1.0. Your CPU and dependency versions will differ; re-run `./scripts/compare_libraries.sh` to refresh.*
81
+
82
+ ### Speed (pytest-benchmark, same machine snapshot)
83
+
84
+ Rough single-thread means (`pytest benchmarks/test_bench_detection.py`, speed filter; **your numbers will differ**):
85
+
86
+ | Sample (`from_bytes` / `detect`) | bytesense | chardet |
87
+ |----------------------------------|-----------|---------|
88
+ | `utf8_bom` | ~4 µs | ~104 µs |
89
+ | `utf8_ascii_only` | ~46 µs | ~114 µs |
90
+
91
+ UTF-8 with BOM is an early exit in bytesense; chardet still runs its full probe. Profile your own payloads.
92
+
93
+ ### How to reproduce
94
+
95
+ ```bash
96
+ pip install -e ".[dev]"
97
+ python scripts/build_fingerprints.py
98
+ python scripts/fetch_cn_benchmark_samples.py # CN `data/` mirrors into benchmarks/data/cn_official/
99
+ ./scripts/compare_libraries.sh # prints the tables above (accuracy + hard stress)
100
+ ```
101
+
102
+ Options: `./scripts/compare_libraries.sh --no-fetch` (no download), `--no-hard` (skip 24-case stress). Full benches: `./scripts/run_all_benchmarks.sh`.
103
+
104
+ **Fingerprint script:** Lines like `utf_16` “skipped” are normal — UTF-16/32 are handled via BOM / null-byte rules, not the single-byte histogram table.
105
+
106
+ **Troubleshooting `fetch_cn_benchmark_samples.py` (SSL on macOS):** Use `.venv/bin/python scripts/...`. `certifi` is in `[dev]`. If HTTPS still fails: *Install Certificates.command* for your Python, or `python scripts/fetch_cn_benchmark_samples.py --insecure` as a last resort.
107
+
108
+ ## Installation
109
+
110
+ ```bash
111
+ pip install bytesense # pure Python — always works
112
+ pip install "bytesense[fast]" # same API; uses Rust wheel when available for your platform
113
+ ```
114
+
115
+ **Test PyPI** (pre-release smoke test): after publishing to [test.pypi.org](https://test.pypi.org/), install with an extra index so pip can fetch build tools from PyPI:
116
+
117
+ ```bash
118
+ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ bytesense
119
+ ```
120
+
121
+ ## CLI
122
+
123
+ `bytesense` ships with a small CLI for files on disk.
124
+
125
+ ```text
126
+ usage: bytesense [-h] [-v] [-m] [--version] FILE [FILE ...]
127
+
128
+ positional arguments:
129
+ FILE File(s) to analyse
130
+
131
+ optional arguments:
132
+ -h, --help show this help and exit
133
+ -v, --verbose Include the ``why`` field in JSON (``confidence_interval`` is always shown)
134
+ -m, --minimal Print only the detected encoding name
135
+ --version Show version and exit
136
+ ```
137
+
138
+ ```bash
139
+ bytesense ./README.md
140
+ bytesense -m ./README.md
141
+ python -m bytesense.cli --version
142
+ ```
143
+
144
+ `stdout` is JSON (one object per file, or a list when multiple files and not `-m`):
145
+
146
+ ```json
147
+ {
148
+ "encoding": "utf_8",
149
+ "confidence": 0.98,
150
+ "confidence_interval": [0.93, 1.0],
151
+ "language": "English",
152
+ "alternatives": [],
153
+ "bom_detected": false,
154
+ "chaos": 0.02,
155
+ "coherence": 0.41,
156
+ "byte_count": 2048,
157
+ "path": "/absolute/path/to/file"
158
+ }
159
+ ```
160
+
161
+ Use ``-v`` to add the human-readable ``why`` string to each object.
162
+
163
+ ## Python
164
+
165
+ **Full result object**
166
+
167
+ ```python
168
+ from bytesense import from_bytes, from_path
169
+
170
+ result = from_path("notes.txt")
171
+ print(result.encoding, result.confidence, result.language)
172
+ print(result.why)
173
+ ```
174
+
175
+ **Streaming**
176
+
177
+ ```python
178
+ from bytesense import StreamDetector
179
+
180
+ det = StreamDetector()
181
+ for chunk in response.iter_content(1024):
182
+ det.feed(chunk)
183
+ if det.confidence >= 0.99:
184
+ break
185
+ print(det.encoding, det.language)
186
+ ```
187
+
188
+ ## Repair mojibake
189
+
190
+ ```python
191
+ from bytesense import repair
192
+
193
+ garbled = "été" # UTF-8 read as Latin-1
194
+ result = repair(garbled)
195
+ if result.improved:
196
+ print(result.repaired) # "été"
197
+ print(result.chain) # ("latin_1", "utf_8")
198
+ print(result.improvement) # e.g. 0.34
199
+ ```
200
+
201
+ ## Stream from HTTP
202
+
203
+ ```python
204
+ from bytesense import detect_stream
205
+ import urllib.request
206
+
207
+ with urllib.request.urlopen("https://example.com") as resp:
208
+ result = detect_stream(
209
+ iter(lambda: resp.read(1024), b""),
210
+ stop_confidence=0.99,
211
+ )
212
+ print(result.encoding)
213
+ ```
214
+
215
+ ## HTML/XML hints
216
+
217
+ ```python
218
+ from bytesense import best_hint, from_bytes
219
+
220
+ html = b'<meta charset="cp1252"><p>Hëllo</p>'
221
+ hint = best_hint(html, headers={"Content-Type": "text/html"})
222
+ result = from_bytes(html)
223
+ print(hint, result.encoding)
224
+ ```
225
+
226
+ ## Multi-encoding documents
227
+
228
+ ```python
229
+ from bytesense import detect_multi
230
+
231
+ # Example: bytes from a legacy .eml or mixed scrape
232
+ your_bytes = b"..." # replace with your document
233
+ result = detect_multi(your_bytes)
234
+ print(f"Uniform encoding: {result.is_uniform}")
235
+ for seg in result.segments:
236
+ print(f" [{seg.start}:{seg.end}] {seg.encoding} — {seg.text[:40]!r}")
237
+ ```
238
+
239
+ **Drop-in `detect()`**
240
+
241
+ ```python
242
+ from bytesense import detect
243
+
244
+ print(detect(b"hello")) # {"encoding", "confidence", "language"}
245
+ ```
246
+
247
+ More detail: [docs/api.md](docs/api.md) · [docs/quickstart.md](docs/quickstart.md)
248
+
249
+ ## Why bytesense
250
+
251
+ - **Decode as late as possible.** Histograms, BOMs, and null-byte layout often rule out whole families of encodings before you spend CPU on full decodes.
252
+ - **Shortlist, then verify.** Cosine similarity against pre-generated fingerprints (see `scripts/build_fingerprints.py`) keeps the expensive “mess + coherence” phase on a handful of candidates.
253
+ - **No black boxes.** No training step, no weights to tune, no network calls—just tables and statistics you can inspect.
254
+ - **Rust is optional.** `pip install bytesense` never requires a compiler; Rust only accelerates hot paths when a wheel matches your platform.
255
+
256
+ ## How it works (short)
257
+
258
+ 1. **Fingerprint** the byte distribution and compare to pre-computed vectors.
259
+ 2. **Decode** only the shortlisted encodings (strict), in a controlled order.
260
+ 3. **Mess** — score how “garbled” the decoded text looks (printable ratio, bigrams, etc.).
261
+ 4. **Coherence** — score language plausibility using character-frequency priors.
262
+ 5. **Rank** and return the best hypothesis plus a human-readable **why**.
263
+
264
+ ## Known limitations
265
+
266
+ - Very short inputs (dozens of bytes) are inherently ambiguous; any detector will guess.
267
+ - Mixed-language text can confuse language coherence.
268
+ - Like any heuristic detector, adversarial or random binary data may yield a best-effort encoding with low confidence.
269
+
270
+ ## Contributing
271
+
272
+ Issues and PRs are welcome: [CONTRIBUTING.md](CONTRIBUTING.md) · [Issues](https://github.com/oguzhankir/bytesense/issues)
273
+
274
+ ## License
275
+
276
+ MIT — see [LICENSE](LICENSE).
277
+
278
+ Copyright © Oğuzhan Kır.