jtoken 0.2.0__tar.gz → 0.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
jtoken-0.2.2/PKG-INFO ADDED
@@ -0,0 +1,285 @@
1
+ Metadata-Version: 2.4
2
+ Name: jtoken
3
+ Version: 0.2.2
4
+ Summary: Compress JSON-shaped documents for LLM prompts with normalization, CLI, and token measurement
5
+ Project-URL: Homepage, https://github.com/hermannsamimi/jtoken
6
+ Project-URL: Repository, https://github.com/hermannsamimi/jtoken
7
+ Project-URL: Issues, https://github.com/hermannsamimi/jtoken/issues
8
+ Author-email: Hermann Samimi <hermannsamimi@gmail.com>
9
+ License-Expression: MIT
10
+ License-File: LICENSE
11
+ Keywords: encoding,format,key-value,llm,serialization,text,tokens
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.8
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
22
+ Classifier: Topic :: Text Processing :: General
23
+ Requires-Python: >=3.8
24
+ Provides-Extra: dev
25
+ Requires-Dist: build>=1.0; extra == 'dev'
26
+ Requires-Dist: pytest-cov>=4.0; extra == 'dev'
27
+ Requires-Dist: pytest>=7.0; extra == 'dev'
28
+ Requires-Dist: tiktoken>=0.5; extra == 'dev'
29
+ Provides-Extra: tiktoken
30
+ Requires-Dist: tiktoken>=0.5; extra == 'tiktoken'
31
+ Description-Content-Type: text/markdown
32
+
33
+ # jtoken
34
+
35
+ **Author:** Hermann Samimi
36
+
37
+ **jtoken** compresses JSON-shaped documents for LLM prompts: fewer tokens, readable line-oriented output, and lossless round-trip for supported scalar nested dicts. It includes normalization for Elasticsearch hits and MongoDB JSON, a CLI, and token measurement helpers.
38
+
39
+ Python 3.8+.
40
+
41
+ ## Installation
42
+
43
+ ### Core (no extra runtime dependencies)
44
+
45
+ ```bash
46
+ pip install jtoken
47
+ ```
48
+
49
+ ### With accurate OpenAI-style token counting
50
+
51
+ ```bash
52
+ pip install "jtoken[tiktoken]"
53
+ ```
54
+
55
+ The core package uses only the Python standard library. Install the `tiktoken` extra when you want tokenizer-accurate counts for OpenAI-compatible models.
56
+
57
+ ## Quick start
58
+
59
+ ```python
60
+ import jtoken
61
+
62
+ data = {
63
+ "user": "alice",
64
+ "age": 30,
65
+ "premium": True,
66
+ "verified": True,
67
+ "is_remote": False,
68
+ "trial": False,
69
+ "score": 9.5,
70
+ "referral": None,
71
+ "last_login": None,
72
+ }
73
+
74
+ text = jtoken.encode(data)
75
+ restored = jtoken.decode(text)
76
+ assert restored == data
77
+ ```
78
+
79
+ Aliases: `jtoken.dumps` = `encode`, `jtoken.loads` = `decode`.
80
+
81
+ ## End-to-end document workflow
82
+
83
+ ```python
84
+ import jtoken
85
+
86
+ raw = open("hit.json", encoding="utf-8").read()
87
+ text, context = jtoken.encode_document(raw, source="elastic_hit")
88
+ restored = jtoken.decode_document(text, target="elastic_hit", context=context)
89
+ ```
90
+
91
+ Keep the normalization context sidecar when you need a lossless decode back into Mongo shell, Extended JSON, or an Elasticsearch hit envelope.
92
+
93
+ ## Format overview
94
+
95
+ **JSON**
96
+
97
+ ```json
98
+ {"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}
99
+ ```
100
+
101
+ **jtoken**
102
+
103
+ ```text
104
+ name: Alice
105
+ age: 30
106
+ trues: active
107
+ falses: verified
108
+ nulls: ref
109
+ ```
110
+
111
+ ### Encoding rules
112
+
113
+ - Nested dicts flatten with dot notation.
114
+ - `True`, `False`, and `None` collapse into `trues:`, `falses:`, and `nulls:` summary lines.
115
+ - Ambiguous strings keep quotes on encode.
116
+ - Multiline strings are JSON-quoted on one line.
117
+ - Keys containing `.` are escaped during normalization and restored from context.
118
+
119
+ ### Supported scalar types
120
+
121
+ `str`, `int`, `float`, `bool`, `None`, and nested `dict`.
122
+
123
+ ### Limitations
124
+
125
+ - Keys cannot contain `": "` in the core codec.
126
+ - Reserved top-level keys: `nulls`, `trues`, `falses`.
127
+ - Lists are normalized into nested dicts with numeric keys before encoding.
128
+
129
+ ## Input and output formats
130
+
131
+ Use `source=` / `target=` in Python or `--input-format` / `--output-format` on the CLI. `encode`, `stats`, and `count` accept `--input-format` (default `auto`). `decode` accepts `--output-format` (default `json`).
132
+
133
+ | Input (`source` / `--input-format`) | Use when |
134
+ |---|---|
135
+ | `auto` | Let jtoken detect the dialect from the text or object shape |
136
+ | `json` | Standard JSON object |
137
+ | `python` | Same JSON parser as `json` |
138
+ | `mongo_extended` | MongoDB Extended JSON with `$oid`, `$date`, `$numberInt`, `$numberLong`, `$numberDouble`, `$numberDecimal` |
139
+ | `mongo_shell` | MongoDB shell document with `ObjectId()`, `ISODate()`, `NumberInt()`, `NumberLong()` |
140
+ | `elastic_hit` | Elasticsearch search hit with `_source` (and optional `fields`) |
141
+ | `elastic_source` | `_source` payload only, or a document wrapped as `{"_source": {...}}` |
142
+
143
+ | Output (`target` / `--output-format`) | Use when |
144
+ |---|---|
145
+ | `python` | Python `repr` (Python API default) |
146
+ | `json` | Pretty-printed JSON (CLI `decode` default) |
147
+ | `mongo_extended` | Extended JSON; requires a context sidecar for BSON-like types |
148
+ | `mongo_shell` | Mongo shell document; requires a context sidecar for BSON-like types |
149
+ | `elastic_hit` | Full Elasticsearch hit envelope; requires a context sidecar |
150
+ | `elastic_source` | JSON shaped like an Elasticsearch `_source` wrapper |
151
+
152
+ With `auto`, jtoken picks `mongo_shell` when it sees `ObjectId(...)` or `ISODate(...)`, `elastic_hit` when the object has a dict `_source`, `mongo_extended` when Extended JSON markers such as `$oid` or `$date` appear, and otherwise `json`.
153
+
154
+ Write the normalization context to a sidecar on encode (`--context-out` / `NormalizationContext.to_dict()`) and pass it back on decode when the output dialect is not plain JSON or Python. The sidecar records list paths, dotted keys, Elasticsearch envelope metadata, and MongoDB type markers in `typed_values` (`object_id`, `datetime`, `long`).
155
+
156
+ ### MongoDB shell and Extended JSON
157
+
158
+ Mongo shell input is parsed as JSON after rewriting shell literals: `ObjectId("...")` and `ISODate("...")` become Extended JSON, `NumberInt(n)` becomes a plain integer, and `NumberLong(n)` becomes `{"$numberLong": "n"}`. On normalize, `object_id`, `datetime`, and `long` values are stored in the context so `mongo_extended` and `mongo_shell` output can restore `{"$oid": ...}` / `ObjectId(...)`, `{"$date": ...}` / `ISODate(...)`, and `{"$numberLong": ...}` / `NumberLong(...)`. `$numberInt`, `$numberDouble`, and `$numberDecimal` are coerced to Python scalars and are not tracked in `typed_values`.
159
+
160
+ ### Elasticsearch hits
161
+
162
+ `elastic_hit` encodes the merged `_source` document (plus any `fields` values that are not already present in `_source`) and stores `_index`, `_id`, `_version`, `_score`, `_type`, and `_routing` in the context for lossless `elastic_hit` output.
163
+
164
+ ## Public API reference
165
+
166
+ ### Package metadata
167
+
168
+ | name | type | description |
169
+ |---|---|---|
170
+ | `jtoken.__version__` | `str` | package version |
171
+ | `jtoken.__author__` | `str` | author name (`Hermann Samimi`) |
172
+
173
+ ### Core codec
174
+
175
+ | function | signature | description |
176
+ |---|---|---|
177
+ | `encode` | `encode(data: dict) -> str` | compress a nested scalar dict into jtoken text |
178
+ | `decode` | `decode(text: str) -> dict` | reconstruct the nested dict |
179
+ | `dumps` | alias of `encode` | json-style alias |
180
+ | `loads` | alias of `decode` | json-style alias |
181
+
182
+ ### Normalization and denormalization
183
+
184
+ | function | signature | description |
185
+ |---|---|---|
186
+ | `parse_input` | `parse_input(text, *, source="auto")` | parse foreign text into Python data |
187
+ | `normalize` | `normalize(data, *, source="auto", context=None)` | return `(normalized_dict, NormalizationContext)` |
188
+ | `denormalize` | `denormalize(data, *, target="python", context)` | restore lists, typed values, and dialect shape |
189
+ | `render_output` | `render_output(value, *, target="python") -> str` | render denormalized data as text |
190
+ | `encode_document` | `encode_document(raw, *, source="auto", context=None)` | return `(jtoken_text, NormalizationContext)` |
191
+ | `decode_document` | `decode_document(text, *, target="python", context)` | decode jtoken text and denormalize |
192
+
193
+ ### Token measurement
194
+
195
+ | function | signature | description |
196
+ |---|---|---|
197
+ | `count_tokens` | `count_tokens(data, *, model="cl100k_base", backend="auto") -> int` | count tokens for a dict or encoded jtoken string |
198
+ | `count_text_tokens` | `count_text_tokens(text, *, model="cl100k_base", backend="auto") -> int` | count tokens for raw text |
199
+ | `token_savings` | `token_savings(data, *, model="cl100k_base", backend="auto", json_indent=2)` | compare jtoken vs pretty JSON token usage |
200
+
201
+ ### `TokenSavings` properties
202
+
203
+ | property | type | description |
204
+ |---|---|---|
205
+ | `jtoken_tokens` | `int` | token count for the jtoken representation |
206
+ | `json_tokens` | `int` | token count for the JSON baseline |
207
+ | `saved` | `int` | `json_tokens - jtoken_tokens` |
208
+ | `percent` | `float` | percent saved relative to JSON |
209
+
210
+ `str(stats)` prints a one-line summary.
211
+
212
+ ### `NormalizationContext` fields
213
+
214
+ | field | type | description |
215
+ |---|---|---|
216
+ | `source_format` | `str` | input dialect used during normalization |
217
+ | `target_format` | `str \| None` | optional output hint |
218
+ | `typed_values` | `dict[str, str]` | dotted paths with BSON-like type markers |
219
+ | `lists` | `set[str]` | dotted paths that were lists before flattening |
220
+ | `dotted_keys` | `dict[str, str]` | escaped keys that originally contained `.` |
221
+ | `elastic` | `dict \| None` | Elasticsearch envelope metadata |
222
+
223
+ Methods: `to_dict()`, `from_dict(data)`.
224
+
225
+ ### Format enums
226
+
227
+ `InputFormat`: `auto`, `json`, `python`, `mongo_extended`, `mongo_shell`, `elastic_hit`, `elastic_source`
228
+
229
+ `OutputFormat`: `python`, `json`, `mongo_extended`, `mongo_shell`, `elastic_hit`, `elastic_source`
230
+
231
+ ### Exceptions
232
+
233
+ | exception | base | when raised |
234
+ |---|---|---|
235
+ | `JPackError` | `Exception` | base library error |
236
+ | `JPackEncodeError` | `JPackError` | encoding fails |
237
+ | `JPackDecodeError` | `JPackError` | decoding fails |
238
+ | `NormalizationError` | `JPackError` | normalization fails |
239
+ | `DenormalizationError` | `JPackError` | denormalization fails |
240
+ | `TokenCountError` | `JPackError` | token counting fails |
241
+
242
+ ## Token counting
243
+
244
+ ```python
245
+ stats = jtoken.token_savings(data, model="gpt-4o", backend="tiktoken", json_indent=2)
246
+ print(stats.jtoken_tokens, stats.json_tokens, stats.saved, stats.percent)
247
+ ```
248
+
249
+ | `backend` | behavior |
250
+ |---|---|
251
+ | `auto` | use `tiktoken` when installed, otherwise estimate |
252
+ | `tiktoken` | require `tiktoken` |
253
+ | `estimate` | simple character heuristic |
254
+
255
+ `json_indent=2` compares against prompt-style pretty JSON. Use `json_indent=None` for compact JSON.
256
+
257
+ ## CLI
258
+
259
+ ```bash
260
+ jtoken encode --input-format mongo_shell -f doc.json --context-out doc.ctx.json
261
+ jtoken decode --output-format mongo_shell -f doc.jtoken --context-in doc.ctx.json
262
+ jtoken stats --input-format json -f doc.json --model gpt-4o --backend tiktoken
263
+ jtoken count --input-format json -f doc.json --backend estimate
264
+ python -m jtoken encode
265
+ ```
266
+
267
+ Common flags:
268
+
269
+ - `-f/--file`
270
+ - `--input-format`
271
+ - `--output-format`
272
+ - `--context-out`
273
+ - `--context-in`
274
+ - `--model`
275
+ - `--backend`
276
+
277
+ ## Links
278
+
279
+ - Homepage: https://github.com/hermannsamimi/jtoken
280
+ - Repository: https://github.com/hermannsamimi/jtoken
281
+ - Issues: https://github.com/hermannsamimi/jtoken/issues
282
+
283
+ ## License
284
+
285
+ MIT — Copyright (c) 2026 Hermann Samimi
jtoken-0.2.2/README.md ADDED
@@ -0,0 +1,224 @@
1
+ # jtoken
2
+
3
+ Compress JSON for LLM prompts — same data, fewer tokens.
4
+
5
+ **Author:** Hermann Samimi
6
+
7
+ jtoken strips JSON syntactic noise, collapses repeated booleans and nulls into summary lines, flattens nested dicts with dot notation, and supports normalization for Elasticsearch hits and MongoDB JSON. The package ships as a stdlib-first library with an optional `tiktoken` extra and a `jtoken` CLI.
8
+
9
+ ## Installation
10
+
11
+ ### Core
12
+
13
+ ```bash
14
+ pip install jtoken
15
+ ```
16
+
17
+ No extra runtime dependencies.
18
+
19
+ ### With tokenizer-accurate counting
20
+
21
+ ```bash
22
+ pip install "jtoken[tiktoken]"
23
+ ```
24
+
25
+ Use the `tiktoken` extra when you want OpenAI-compatible token counts instead of the built-in estimate backend.
26
+
27
+ ## Quick start
28
+
29
+ ```python
30
+ import jtoken
31
+
32
+ data = {
33
+ "user": "alice",
34
+ "age": 30,
35
+ "premium": True,
36
+ "verified": True,
37
+ "is_remote": False,
38
+ "trial": False,
39
+ "score": 9.5,
40
+ "referral": None,
41
+ "last_login": None,
42
+ }
43
+
44
+ text = jtoken.encode(data)
45
+ original = jtoken.decode(text)
46
+ assert original == data
47
+ ```
48
+
49
+ `dumps` / `loads` are json-style aliases for `encode` / `decode`.
50
+
51
+ ## What the format looks like
52
+
53
+ **JSON**
54
+
55
+ ```json
56
+ {"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}
57
+ ```
58
+
59
+ **jtoken**
60
+
61
+ ```text
62
+ name: Alice
63
+ age: 30
64
+ trues: active
65
+ falses: verified
66
+ nulls: ref
67
+ ```
68
+
69
+ Nested dicts flatten with dot notation. Booleans and nulls at any depth collapse into the same summary lines. Decode reconstructs the original nested structure.
70
+
71
+ ## Normalization and denormalization
72
+
73
+ Foreign document shapes can be normalized before encoding and restored after decode with a sidecar context.
74
+
75
+ ```python
76
+ import jtoken
77
+
78
+ raw_hit = {...}
79
+ normalized, context = jtoken.normalize(raw_hit, source="elastic_hit")
80
+ text = jtoken.encode(normalized)
81
+ restored = jtoken.denormalize(
82
+ jtoken.decode(text),
83
+ target="elastic_hit",
84
+ context=context,
85
+ )
86
+ ```
87
+
88
+ ```bash
89
+ jtoken encode --input-format elastic_hit -f hit.json --context-out hit.ctx.json
90
+ jtoken decode --output-format mongo_shell -f hit.jtoken --context-in hit.ctx.json
91
+ ```
92
+
93
+ ### Input and output formats
94
+
95
+ Use `source=` / `target=` in Python or `--input-format` / `--output-format` on the CLI. `encode`, `stats`, and `count` accept `--input-format` (default `auto`). `decode` accepts `--output-format` (default `json`).
96
+
97
+ | Input (`source` / `--input-format`) | Use when |
98
+ |---|---|
99
+ | `auto` | Let jtoken detect the dialect from the text or object shape |
100
+ | `json` | Standard JSON object |
101
+ | `python` | Same JSON parser as `json` |
102
+ | `mongo_extended` | MongoDB Extended JSON with `$oid`, `$date`, `$numberInt`, `$numberLong`, `$numberDouble`, `$numberDecimal` |
103
+ | `mongo_shell` | MongoDB shell document with `ObjectId()`, `ISODate()`, `NumberInt()`, `NumberLong()` |
104
+ | `elastic_hit` | Elasticsearch search hit with `_source` (and optional `fields`) |
105
+ | `elastic_source` | `_source` payload only, or a document wrapped as `{"_source": {...}}` |
106
+
107
+ | Output (`target` / `--output-format`) | Use when |
108
+ |---|---|
109
+ | `python` | Python `repr` (Python API default) |
110
+ | `json` | Pretty-printed JSON (CLI `decode` default) |
111
+ | `mongo_extended` | Extended JSON; requires a context sidecar for BSON-like types |
112
+ | `mongo_shell` | Mongo shell document; requires a context sidecar for BSON-like types |
113
+ | `elastic_hit` | Full Elasticsearch hit envelope; requires a context sidecar |
114
+ | `elastic_source` | JSON shaped like an Elasticsearch `_source` wrapper |
115
+
116
+ With `auto`, jtoken picks `mongo_shell` when it sees `ObjectId(...)` or `ISODate(...)`, `elastic_hit` when the object has a dict `_source`, `mongo_extended` when Extended JSON markers such as `$oid` or `$date` appear, and otherwise `json`.
117
+
118
+ Write the normalization context to a sidecar on encode (`--context-out` / `NormalizationContext.to_dict()`) and pass it back on decode when the output dialect is not plain JSON or Python. The sidecar records list paths, dotted keys, Elasticsearch envelope metadata, and MongoDB type markers in `typed_values` (`object_id`, `datetime`, `long`).
119
+
120
+ ### MongoDB shell and Extended JSON
121
+
122
+ Mongo shell input is parsed as JSON after rewriting shell literals: `ObjectId("...")` and `ISODate("...")` become Extended JSON, `NumberInt(n)` becomes a plain integer, and `NumberLong(n)` becomes `{"$numberLong": "n"}`. On normalize, `object_id`, `datetime`, and `long` values are stored in the context so `mongo_extended` and `mongo_shell` output can restore `{"$oid": ...}` / `ObjectId(...)`, `{"$date": ...}` / `ISODate(...)`, and `{"$numberLong": ...}` / `NumberLong(...)`. `$numberInt`, `$numberDouble`, and `$numberDecimal` are coerced to Python scalars and are not tracked in `typed_values`.
123
+
124
+ ### Elasticsearch hits
125
+
126
+ `elastic_hit` encodes the merged `_source` document (plus any `fields` values that are not already present in `_source`) and stores `_index`, `_id`, `_version`, `_score`, `_type`, and `_routing` in the context for lossless `elastic_hit` output.
127
+
128
+ ## CLI
129
+
130
+ ```bash
131
+ echo '{"name": "Alice", "active": true}' | jtoken encode
132
+ echo 'name: Alice\ntrues: active' | jtoken decode
133
+ echo '{"name": "Alice", "active": true}' | jtoken stats
134
+ echo '{"name": "Alice", "active": true}' | jtoken count
135
+ ```
136
+
137
+ Use `-f/--file` for file input. `encode`, `stats`, and `count` accept `--input-format`. `decode` accepts `--output-format` and `--context-in` when restoring non-JSON dialects. `stats` and `count` accept `--model` and `--backend`.
138
+
139
+ ## Token savings
140
+
141
+ ```python
142
+ import jtoken
143
+
144
+ stats = jtoken.token_savings(data, model="gpt-4o", backend="tiktoken", json_indent=2)
145
+ print(stats)
146
+ # jtoken: 22 tokens | json: 36 tokens | saved: 14 (38.9%)
147
+
148
+ print(stats.jtoken_tokens, stats.json_tokens, stats.saved, stats.percent)
149
+ ```
150
+
151
+ `count_tokens` and `count_text_tokens` are also available. Savings compare the jtoken representation against pretty JSON by default (`json_indent=2`).
152
+
153
+ ## API reference
154
+
155
+ ### Package metadata
156
+
157
+ - `jtoken.__version__`
158
+ - `jtoken.__author__`
159
+
160
+ ### Core codec
161
+
162
+ - `encode(data: dict) -> str`
163
+ - `decode(text: str) -> dict`
164
+ - `dumps` / `loads`
165
+
166
+ ### Normalization
167
+
168
+ - `parse_input(text, *, source="auto")`
169
+ - `normalize(data, *, source="auto", context=None) -> tuple[dict, NormalizationContext]`
170
+ - `denormalize(data, *, target="python", context)`
171
+ - `render_output(value, *, target="python") -> str`
172
+ - `encode_document(raw, *, source="auto", context=None) -> tuple[str, NormalizationContext]`
173
+ - `decode_document(text, *, target="python", context)`
174
+
175
+ ### Token helpers
176
+
177
+ - `count_tokens(data, *, model="cl100k_base", backend="auto") -> int`
178
+ - `count_text_tokens(text, *, model="cl100k_base", backend="auto") -> int`
179
+ - `token_savings(data, *, model="cl100k_base", backend="auto", json_indent=2) -> TokenSavings`
180
+
181
+ ### `TokenSavings`
182
+
183
+ - `jtoken_tokens`
184
+ - `json_tokens`
185
+ - `saved`
186
+ - `percent`
187
+
188
+ ### `NormalizationContext`
189
+
190
+ - `source_format`
191
+ - `target_format`
192
+ - `typed_values`
193
+ - `lists`
194
+ - `dotted_keys`
195
+ - `elastic`
196
+ - `to_dict()` / `from_dict()`
197
+
198
+ ### Format enums
199
+
200
+ - `InputFormat`
201
+ - `OutputFormat`
202
+
203
+ ### Exceptions
204
+
205
+ - `JPackError`
206
+ - `JPackEncodeError`
207
+ - `JPackDecodeError`
208
+ - `NormalizationError`
209
+ - `DenormalizationError`
210
+ - `TokenCountError`
211
+
212
+ ## Development
213
+
214
+ ```bash
215
+ git clone https://github.com/hermannsamimi/jtoken
216
+ cd jtoken
217
+ pip install -e ".[dev]"
218
+ pytest
219
+ pytest --cov=jtoken --cov-report=term-missing
220
+ ```
221
+
222
+ ## License
223
+
224
+ MIT — © 2026 Hermann Samimi