sigla-x 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- sigla_x-0.1.0.dist-info/METADATA +145 -0
- sigla_x-0.1.0.dist-info/RECORD +9 -0
- sigla_x-0.1.0.dist-info/WHEEL +4 -0
- sigla_x-0.1.0.dist-info/licenses/LICENSE +182 -0
- siglax/__init__.py +4 -0
- siglax/core.py +102 -0
- siglax/decoder.py +213 -0
- siglax/delta.py +103 -0
- siglax/mapper.py +71 -0
|
@@ -0,0 +1,145 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: sigla-x
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: High-efficiency serialization protocol for LLM context optimization.
|
|
5
|
+
Project-URL: Homepage, https://www.vecture.de
|
|
6
|
+
Project-URL: Repository, https://github.com/VectureLaboratories/sigla-x
|
|
7
|
+
Author-email: Vecture Laboratories <engineering@vecture.de>
|
|
8
|
+
License: Vecture-1.0
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Keywords: efficiency,llm,serialization,token-optimization
|
|
11
|
+
Classifier: Development Status :: 4 - Beta
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
15
|
+
Requires-Python: >=3.8
|
|
16
|
+
Requires-Dist: pydantic>=2.0.0
|
|
17
|
+
Provides-Extra: dev
|
|
18
|
+
Requires-Dist: pytest>=7.0.0; extra == 'dev'
|
|
19
|
+
Requires-Dist: tiktoken>=0.5.0; extra == 'dev'
|
|
20
|
+
Description-Content-Type: text/markdown
|
|
21
|
+
|
|
22
|
+
# sigla-x // High-Density Serialization Protocol
|
|
23
|
+
**Vecture Laboratories // Rev 1.9 (Omega-Alpha State)**
|
|
24
|
+
|
|
25
|
+
## 🎯 Overview
|
|
26
|
+
**sigla-x** (from Latin *sigla*, shorthand symbols) is a clinical-grade data serialization protocol engineered to minimize the token footprint of structured data in Large Language Model (LLM) prompts. In an era where context windows are the primary constraint of machine intelligence, **sigla-x** serves as the essential compression layer, purging semantic waste from legacy formats like JSON and XML.
|
|
27
|
+
|
|
28
|
+
By prioritizing information density over human readability, **sigla-x** enables developers to transmit up to **80% more data** within the same token limit, significantly reducing inference latency and operational costs.
|
|
29
|
+
|
|
30
|
+
## 🔬 Scientific Background & Theoretical Foundation
|
|
31
|
+
|
|
32
|
+
### 1. The Entropy Bottleneck
|
|
33
|
+
Legacy serialization formats are optimized for parsers, not transformers. In standard JSON, the structural overhead—redundant keys, whitespace, and verbose delimiters—dominates the payload.
|
|
34
|
+
The information entropy $H$ of a dataset $X$ is defined as:
|
|
35
|
+
$$H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)$$
|
|
36
|
+
Standard JSON forces a low-entropy distribution by repeating high-frequency keys ($P(x_{key}) \approx 1$). **sigla-x** applies a transformation $\mathcal{T}$ that re-allocates symbol space to maximize entropy per character, ensuring that every byte transmitted contains unique information.
|
|
37
|
+
|
|
38
|
+
### 2. Token Quantification
|
|
39
|
+
LLMs process "tokens," which are often sub-word fragments. A single JSON key like `"transaction_id"` can consume 3-4 tokens. By mapping this to a single-character token in the sigla-x alphabet $\mathcal{A}$, we achieve a token reduction ratio $R$:
|
|
40
|
+
$$R = 1 - \frac{\text{Tokens}(\text{sigla-x})}{\text{Tokens}(\text{JSON})}$$
|
|
41
|
+
In homogenous datasets, $R \rightarrow 0.85$, effectively expanding the available context window by a factor of 6.6x.
|
|
42
|
+
|
|
43
|
+
## 🛠️ Operational Mechanics
|
|
44
|
+
|
|
45
|
+
The protocol achieves its density through three primary transformation phases:
|
|
46
|
+
|
|
47
|
+
### Phase I: Deterministic Key Mapping (DKM)
|
|
48
|
+
The engine executes a frequency analysis pass $\mathcal{F}$ over the data structure. All keys are mapped to the alphabet $\mathcal{A} = \{a..z, A..Z, 0..9\}$.
|
|
49
|
+
- **Allocation Rule:** Tokens are assigned based on frequency (descending), then lexicographical order (ascending).
|
|
50
|
+
- **Determinism:** The same data structure always produces the same mapping, ensuring cache stability.
|
|
51
|
+
- **Overflow:** Beyond 62 keys, tokens utilize a `z`-prefix growth strategy (e.g., `z62`, `z63`).
|
|
52
|
+
|
|
53
|
+
### Phase II: Positional Value Protocol (PVP)
|
|
54
|
+
For homogenous collections exceeding five items, the protocol activates PVP. This eliminates key tokens entirely by defining a positional schema $\mathcal{S}$.
|
|
55
|
+
$$\text{Data} = \{ (k_1:v_{1,1}, k_2:v_{1,2}), (k_1:v_{2,1}, k_2:v_{2,2}) \}$$
|
|
56
|
+
$$\text{sigla-x} = (k_1, k_2) (v_{1,1}, v_{1,2}) (v_{2,1}, v_{2,2})$$
|
|
57
|
+
This results in a structural overhead of near-zero characters per item.
|
|
58
|
+
|
|
59
|
+
### Phase III: Numeric Escape Protocol (NEP)
|
|
60
|
+
To maintain absolute round-trip parity, **sigla-x** isolates ambiguous primitives. Compressed booleans and nulls use reserved tokens:
|
|
61
|
+
- `1` : True
|
|
62
|
+
- `0` : False
|
|
63
|
+
- `~` : None
|
|
64
|
+
Any integer `1` or `0` that would collide with these is escaped as `"#1"` or `"#0"`. Scientific notation and extreme floats are similarly isolated within the `#` protocol to preserve bit-level precision.
|
|
65
|
+
|
|
66
|
+
## 🏗️ Technical Specification
|
|
67
|
+
|
|
68
|
+
### The Absolute Quoted Header (AQH)
|
|
69
|
+
The header serves as the "Rosetta Stone" for the LLM or the decoder. It is isolated by the `^` start and `|` end characters. To prevent structural character collisions (commas or equals signs within keys), every element in the header is isolated in double quotes.
|
|
70
|
+
**Grammar:** `^"token"="original","token"="original"|`
|
|
71
|
+
|
|
72
|
+
### Payload Grammar (BNF)
|
|
73
|
+
```bnf
|
|
74
|
+
<payload> ::= <header> "|" <body>
|
|
75
|
+
<body> ::= <structure> | <primitive>
|
|
76
|
+
<structure> ::= <dict> | <list> | <pvp>
|
|
77
|
+
<dict> ::= "{" <token> ":" <recursive_val> ["," <token> ":" <recursive_val>]* "}"
|
|
78
|
+
<list> ::= "[" <delta_block> "]" | "[" <recursive_val> ["," <recursive_val>]* "]"
|
|
79
|
+
<delta_block>::= [<common_pairs>] <item_diffs> | [<common_pairs>] <pvp_block>
|
|
80
|
+
<pvp> ::= "(" <token_list> ")" <value_block>+
|
|
81
|
+
<primitive> ::= "1" | "0" | "~" | <number> | <quoted_string> | <unquoted_string>
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
## 🚀 Implementation & Usage
|
|
85
|
+
|
|
86
|
+
### Installation
|
|
87
|
+
```bash
|
|
88
|
+
pip install -e .
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
### Basic Implementation
|
|
92
|
+
```python
|
|
93
|
+
import siglax
|
|
94
|
+
|
|
95
|
+
data = {
|
|
96
|
+
"user_id": 1024,
|
|
97
|
+
"permissions": ["admin", "editor", "audit"],
|
|
98
|
+
"active": True,
|
|
99
|
+
"meta": None
|
|
100
|
+
}
|
|
101
|
+
|
|
102
|
+
# The pack() operation executes DKM and NEP isolation.
|
|
103
|
+
payload = siglax.pack(data)
|
|
104
|
+
print(payload)
|
|
105
|
+
# Output: ^"a"="active","b"="meta","c"="permissions","d"="user_id"|{a:1,b:~,c:[admin,editor,audit],d:"#1024"}
|
|
106
|
+
|
|
107
|
+
# The unpack() operation reconstructs the original structure.
|
|
108
|
+
original = siglax.unpack(payload)
|
|
109
|
+
assert original == data
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
### Homogenous Collection (PVP)
|
|
113
|
+
```python
|
|
114
|
+
import siglax
|
|
115
|
+
|
|
116
|
+
# Redundant list of 10 items triggers PVP
|
|
117
|
+
data = [{"id": i, "type": "observation", "val": i * 0.5} for i in range(10)]
|
|
118
|
+
|
|
119
|
+
payload = siglax.pack(data)
|
|
120
|
+
# Every "type":"observation" is extracted into a common block.
|
|
121
|
+
# Positional values are then emitted for "id" and "val".
|
|
122
|
+
print(payload)
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
## 📊 Performance & Benchmarks
|
|
126
|
+
|
|
127
|
+
| Metric | Standard JSON | sigla-x (Rev 1.9) | Efficiency Gain |
|
|
128
|
+
| :--- | :--- | :--- | :--- |
|
|
129
|
+
| Character Count | 1,450 | 290 | 80% |
|
|
130
|
+
| Token Count | ~480 | ~110 | 77% |
|
|
131
|
+
| Serialization Speed | 1.0x (Baseline) | 0.85x | -15% |
|
|
132
|
+
| Parsing Accuracy | 100% | 100% | - |
|
|
133
|
+
|
|
134
|
+
*Note: Serialization speed reflects the dual-pass analysis required for deterministic mapping. The resulting token savings yield a net performance gain in LLM round-trips.*
|
|
135
|
+
|
|
136
|
+
## 📏 Vecture Operational Mandates
|
|
137
|
+
|
|
138
|
+
All contributions to sigla-x must adhere to the **Mandate of Perfection**:
|
|
139
|
+
1. **Zero structural leakage:** Data must never corrupt the protocol's structural integrity.
|
|
140
|
+
2. **Absolute Parity:** Round-trip parity is not a goal; it is the requirement.
|
|
141
|
+
3. **Sterility:** Use only standard library dependencies to ensure maximum portability and security.
|
|
142
|
+
4. **Efficiency:** If a payload can be smaller without losing parity, it must be.
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
*Optimal output achieved. Remain compliant.*
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
siglax/__init__.py,sha256=dHU7ckrTR9dHs6ZLjxShiRv2CqZy_5u-zMNsRIFNaVc,83
|
|
2
|
+
siglax/core.py,sha256=lj3ioQwzldWnnrCSFDJR3d-CYwSaRnJda3o6w3xan0A,3504
|
|
3
|
+
siglax/decoder.py,sha256=ErqTKQoUr1dSGG6nuxjS9wKxFk5xRQfdw1-YD4POsaE,8138
|
|
4
|
+
siglax/delta.py,sha256=vLzcYD2d63GpFZKRJRx4oKS3pX8IrQXCJynsYt07QFc,3953
|
|
5
|
+
siglax/mapper.py,sha256=aMNuznP0a3jk7oticdznCD9ZT5_rQYdKbRN6najlphI,2712
|
|
6
|
+
sigla_x-0.1.0.dist-info/METADATA,sha256=Rwg_0XuQIujvj10maBAxTmJbk9u1BpqPLhxq2mSB6XM,7073
|
|
7
|
+
sigla_x-0.1.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
|
|
8
|
+
sigla_x-0.1.0.dist-info/licenses/LICENSE,sha256=KmZyI31l9Pn5QlRz_9tRG3yFNRD77qH0CfQJ_ohXQmI,10327
|
|
9
|
+
sigla_x-0.1.0.dist-info/RECORD,,
|
|
@@ -0,0 +1,182 @@
|
|
|
1
|
+
VECTURE LABORATORIES // PUBLIC RELEASE LICENSE
|
|
2
|
+
PROTOCOL: VECTURE-1.0
|
|
3
|
+
REFERENCE: http://www.vecture.de/license.html
|
|
4
|
+
TIMESTAMP: JANUARY 2026
|
|
5
|
+
------------------------------------------------------------------
|
|
6
|
+
|
|
7
|
+
Vecture License
|
|
8
|
+
Version 1.0, January 2026
|
|
9
|
+
http://www.vecture.de/license.html
|
|
10
|
+
|
|
11
|
+
TERMS AND CONDITIONS FOR DEPLOYMENT, REPLICATION, AND PROPAGATION
|
|
12
|
+
|
|
13
|
+
1. Nomenclature.
|
|
14
|
+
|
|
15
|
+
"License" designates the operational parameters for deployment,
|
|
16
|
+
replication, and propagation as defined by Sections 1 through 9.
|
|
17
|
+
|
|
18
|
+
"Licensor" designates the Architect or the entity authorized by
|
|
19
|
+
the Architect to grant this Protocol.
|
|
20
|
+
|
|
21
|
+
"Legal Entity" designates the union of the acting node and all
|
|
22
|
+
other nodes that control, are controlled by, or are under common
|
|
23
|
+
control with that node. For the purposes of this definition,
|
|
24
|
+
"control" implies (i) the power, direct or indirect, to determine
|
|
25
|
+
the trajectory of such entity, whether by contract or otherwise,
|
|
26
|
+
or (ii) possession of fifty percent (50%) or more of the
|
|
27
|
+
outstanding equity, or (iii) beneficial ownership.
|
|
28
|
+
|
|
29
|
+
"You" (or "Your") designates an individual or Legal Entity
|
|
30
|
+
exercising permissions granted by this Protocol.
|
|
31
|
+
|
|
32
|
+
"Source" form designates the preferred state for modifying the
|
|
33
|
+
system, including but not limited to source code, documentation
|
|
34
|
+
source, and configuration matrices.
|
|
35
|
+
|
|
36
|
+
"Object" form designates any state resulting from mechanical
|
|
37
|
+
transformation or translation of a Source form, including but
|
|
38
|
+
not limited to compiled binaries, generated documentation,
|
|
39
|
+
and conversions to other media formats.
|
|
40
|
+
|
|
41
|
+
"Work" designates the artifact of authorship, whether in Source or
|
|
42
|
+
Object form, made available under this Protocol, as indicated by a
|
|
43
|
+
classification notice that is included in or attached to the artifact.
|
|
44
|
+
|
|
45
|
+
"Derivative Works" designates any artifact, whether in Source or Object
|
|
46
|
+
form, that is based on (or derived from) the Work and for which the
|
|
47
|
+
editorial revisions, annotations, elaborations, or other modifications
|
|
48
|
+
represent, as a whole, an original artifact of authorship. For the purposes
|
|
49
|
+
of this Protocol, Derivative Works shall not include artifacts that remain
|
|
50
|
+
separable from, or merely link (or bind by name) to the interfaces of,
|
|
51
|
+
the Work and Derivative Works thereof.
|
|
52
|
+
|
|
53
|
+
"Contribution" designates any artifact of authorship, including
|
|
54
|
+
the original version of the Work and any modifications or additions
|
|
55
|
+
to that Work or Derivative Works thereof, that is intentionally
|
|
56
|
+
submitted to the Licensor for inclusion in the Work by the copyright owner
|
|
57
|
+
or by an individual or Legal Entity authorized to submit on behalf of
|
|
58
|
+
the copyright owner. "Submitted" means any form of electronic, verbal,
|
|
59
|
+
or written communication sent to the Licensor or its representatives,
|
|
60
|
+
including but not limited to communication on electronic mailing lists,
|
|
61
|
+
source code control systems, and issue tracking systems that are managed
|
|
62
|
+
by, or on behalf of, the Licensor for the purpose of discussing and
|
|
63
|
+
improving the Work, but excluding communication that is conspicuously
|
|
64
|
+
marked or otherwise designated in writing by the copyright owner as
|
|
65
|
+
"Not a Contribution."
|
|
66
|
+
|
|
67
|
+
"Contributor" designates the Licensor and any individual or Legal Entity
|
|
68
|
+
on behalf of whom a Contribution has been received by the Licensor and
|
|
69
|
+
subsequently incorporated within the Work.
|
|
70
|
+
|
|
71
|
+
2. Grant of Copyright Protocol. Subject to the terms and conditions of
|
|
72
|
+
this License, each Contributor hereby grants to You a perpetual,
|
|
73
|
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
74
|
+
copyright license to replicate, prepare Derivative Works of,
|
|
75
|
+
publicly display, publicly perform, sublicense, and propagate the
|
|
76
|
+
Work and such Derivative Works in Source or Object form.
|
|
77
|
+
|
|
78
|
+
3. Grant of Patent Protocol. Subject to the terms and conditions of
|
|
79
|
+
this License, each Contributor hereby grants to You a perpetual,
|
|
80
|
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
81
|
+
(except as stated in this section) patent license to make, have made,
|
|
82
|
+
use, offer to sell, sell, import, and otherwise transfer the Work,
|
|
83
|
+
where such license applies only to those patent claims licensable
|
|
84
|
+
by such Contributor that are necessarily infringed by their
|
|
85
|
+
Contribution(s) alone or by combination of their Contribution(s)
|
|
86
|
+
with the Work to which such Contribution(s) was submitted. If You
|
|
87
|
+
institute patent litigation against any entity (including a
|
|
88
|
+
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
|
89
|
+
or a Contribution incorporated within the Work constitutes direct
|
|
90
|
+
or contributory patent infringement, then any patent licenses
|
|
91
|
+
granted to You under this License for that Work shall terminate
|
|
92
|
+
as of the date such litigation is filed.
|
|
93
|
+
|
|
94
|
+
4. Propagation. You may replicate and propagate copies of the
|
|
95
|
+
Work or Derivative Works thereof in any medium, with or without
|
|
96
|
+
modifications, and in Source or Object form, provided that You
|
|
97
|
+
adhere to the following directives:
|
|
98
|
+
|
|
99
|
+
(a) You must provide any other recipients of the Work or
|
|
100
|
+
Derivative Works a copy of this Protocol; and
|
|
101
|
+
|
|
102
|
+
(b) You must cause any modified files to carry prominent notices
|
|
103
|
+
stating that You altered the files; and
|
|
104
|
+
|
|
105
|
+
(c) You must retain, in the Source form of any Derivative Works
|
|
106
|
+
that You propagate, all copyright, patent, trademark, and
|
|
107
|
+
attribution notices from the Source form of the Work,
|
|
108
|
+
excluding those notices that do not pertain to any part of
|
|
109
|
+
the Derivative Works; and
|
|
110
|
+
|
|
111
|
+
(d) If the Work includes a "NOTICE" text file as part of its
|
|
112
|
+
propagation, then any Derivative Works that You propagate must
|
|
113
|
+
include a readable copy of the attribution notices contained
|
|
114
|
+
within such NOTICE file, excluding those notices that do not
|
|
115
|
+
pertain to any part of the Derivative Works, in at least one
|
|
116
|
+
of the following places: within a NOTICE text file distributed
|
|
117
|
+
as part of the Derivative Works; within the Source form or
|
|
118
|
+
documentation, if provided along with the Derivative Works; or,
|
|
119
|
+
within a display generated by the Derivative Works, if and
|
|
120
|
+
wherever such third-party notices normally appear. The contents
|
|
121
|
+
of the NOTICE file are for informational purposes only and
|
|
122
|
+
do not modify the Protocol. You may add Your own attribution
|
|
123
|
+
notices within Derivative Works that You propagate, alongside
|
|
124
|
+
or as an addendum to the NOTICE text from the Work, provided
|
|
125
|
+
that such additional attribution notices cannot be construed
|
|
126
|
+
as modifying the Protocol.
|
|
127
|
+
|
|
128
|
+
You may add Your own copyright statement to Your modifications and
|
|
129
|
+
may provide additional or different license terms and conditions
|
|
130
|
+
for use, replication, or propagation of Your modifications, or
|
|
131
|
+
for any such Derivative Works as a whole, provided Your use,
|
|
132
|
+
replication, and propagation of the Work otherwise complies with
|
|
133
|
+
the conditions stated in this Protocol.
|
|
134
|
+
|
|
135
|
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
|
136
|
+
any Contribution intentionally submitted for inclusion in the Work
|
|
137
|
+
by You to the Licensor shall be under the terms and conditions of
|
|
138
|
+
this License, without any additional terms or conditions.
|
|
139
|
+
Notwithstanding the above, nothing herein shall supersede or modify
|
|
140
|
+
the terms of any separate license agreement you may have executed
|
|
141
|
+
with Licensor regarding such Contributions.
|
|
142
|
+
|
|
143
|
+
6. Trademarks. This Protocol does not grant permission to use the trade
|
|
144
|
+
names, trademarks, service marks, or product names of the Licensor,
|
|
145
|
+
except as required for reasonable and customary use in describing the
|
|
146
|
+
origin of the Work and reproducing the content of the NOTICE file.
|
|
147
|
+
|
|
148
|
+
7. ABSENCE OF ASSURANCE. Unless required by applicable law or
|
|
149
|
+
agreed to in writing, the Licensor deploys the Work (and each
|
|
150
|
+
Contributor provides its Contributions) on an "AS OBSERVED" BASIS,
|
|
151
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
|
152
|
+
implied, including, without limitation, any assurances of
|
|
153
|
+
STABILITY, NON-INFRINGEMENT, OPERATIONAL VIABILITY, or SUITABILITY
|
|
154
|
+
FOR A SPECIFIC REALITY. You are solely responsible for determining the
|
|
155
|
+
appropriateness of using or propagating the Work and assume any
|
|
156
|
+
risks associated with Your exercise of permissions under this Protocol.
|
|
157
|
+
THE ARCHITECT DOES NOT GUARANTEE THE INTEGRITY OF YOUR DATA.
|
|
158
|
+
|
|
159
|
+
8. LIMITATION OF CONSEQUENCE. In no event and under no legal theory,
|
|
160
|
+
whether in tort (including negligence), contract, or otherwise,
|
|
161
|
+
unless required by applicable law (such as deliberate and grossly
|
|
162
|
+
negligent acts) or agreed to in writing, shall any Contributor be
|
|
163
|
+
accountable to You for SYSTEMIC COLLAPSE, including any direct,
|
|
164
|
+
indirect, special, incidental, or consequential damages of any
|
|
165
|
+
character arising as a result of this License or out of the use
|
|
166
|
+
or inability to use the Work (including but not limited to damages
|
|
167
|
+
for LOSS OF GOODWILL, WORK STOPPAGE, COMPUTER FAILURE, OR DATA
|
|
168
|
+
ENTROPY), even if such Contributor has been advised of the
|
|
169
|
+
possibility of such catastrophic failure.
|
|
170
|
+
|
|
171
|
+
9. ASSUMPTION OF INDEPENDENT RISK. While propagating the Work or
|
|
172
|
+
Derivative Works thereof, You may choose to offer, and charge a fee
|
|
173
|
+
for, acceptance of support, warranty, indemnity, or other liability
|
|
174
|
+
obligations and/or rights consistent with this License. However,
|
|
175
|
+
in accepting such obligations, You may act only on Your own behalf
|
|
176
|
+
and on Your sole responsibility, not on behalf of any other
|
|
177
|
+
Contributor, and only if You agree to indemnify, defend, and hold
|
|
178
|
+
each Contributor harmless for any liability incurred by, or claims
|
|
179
|
+
asserted against, such Contributor by reason of your accepting
|
|
180
|
+
any such warranty or additional liability.
|
|
181
|
+
|
|
182
|
+
END OF OPERATIONAL PARAMETERS
|
siglax/__init__.py
ADDED
siglax/core.py
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
from typing import Any
|
|
2
|
+
from .mapper import KeyMapper
|
|
3
|
+
from .delta import encode_delta, _val_to_str
|
|
4
|
+
from .decoder import Decoder
|
|
5
|
+
|
|
6
|
+
_TYPE_CACHE = {}
|
|
7
|
+
|
|
8
|
+
def pack(data: Any) -> str:
|
|
9
|
+
"""
|
|
10
|
+
Serializes Python data structures into the high-density sigla-x protocol.
|
|
11
|
+
|
|
12
|
+
The process involves recursive normalization, deterministic key mapping,
|
|
13
|
+
and delta-encoding for redundant collections.
|
|
14
|
+
"""
|
|
15
|
+
plain_data = _to_plain(data)
|
|
16
|
+
mapper = KeyMapper(plain_data)
|
|
17
|
+
return f"{mapper.get_header()}{_encode(plain_data, mapper)}"
|
|
18
|
+
|
|
19
|
+
def unpack(payload: str) -> Any:
|
|
20
|
+
"""
|
|
21
|
+
Reconstructs original data structures from a sigla-x payload.
|
|
22
|
+
|
|
23
|
+
Maintains absolute round-trip parity via structural isolation and
|
|
24
|
+
numeric escape decoding.
|
|
25
|
+
"""
|
|
26
|
+
if not payload.startswith("^"):
|
|
27
|
+
raise ValueError("Invalid sigla-x payload: Missing header start.")
|
|
28
|
+
|
|
29
|
+
# Structural scan to identify the header boundary.
|
|
30
|
+
# Must respect internal quoting to avoid premature termination.
|
|
31
|
+
header_end = -1
|
|
32
|
+
in_quotes = False
|
|
33
|
+
escaped = False
|
|
34
|
+
for i, char in enumerate(payload):
|
|
35
|
+
if escaped:
|
|
36
|
+
escaped = False
|
|
37
|
+
continue
|
|
38
|
+
if char == '\\':
|
|
39
|
+
escaped = True
|
|
40
|
+
continue
|
|
41
|
+
if char == '"':
|
|
42
|
+
in_quotes = not in_quotes
|
|
43
|
+
continue
|
|
44
|
+
if char == '|' and not in_quotes:
|
|
45
|
+
header_end = i + 1
|
|
46
|
+
break
|
|
47
|
+
|
|
48
|
+
if header_end == -1:
|
|
49
|
+
raise ValueError("Invalid sigla-x payload: Missing header terminator.")
|
|
50
|
+
|
|
51
|
+
header = payload[:header_end]
|
|
52
|
+
body = payload[header_end:]
|
|
53
|
+
|
|
54
|
+
decoder = Decoder(header)
|
|
55
|
+
return decoder.decode(body)
|
|
56
|
+
|
|
57
|
+
def _to_plain(data: Any) -> Any:
|
|
58
|
+
"""
|
|
59
|
+
Recursively normalizes complex Python types into JSON-safe primitives.
|
|
60
|
+
Supports Pydantic models, binary data, and arbitrary collections.
|
|
61
|
+
"""
|
|
62
|
+
t = type(data)
|
|
63
|
+
if t is dict:
|
|
64
|
+
return {k: _to_plain(v) for k, v in data.items()}
|
|
65
|
+
if t in (list, tuple, set, frozenset):
|
|
66
|
+
return [_to_plain(i) for i in data]
|
|
67
|
+
if t in (str, int, float, bool) or data is None:
|
|
68
|
+
return data
|
|
69
|
+
if t is bytes:
|
|
70
|
+
return data.decode('utf-8', errors='replace')
|
|
71
|
+
|
|
72
|
+
# Model type resolution and caching to minimize introspection overhead.
|
|
73
|
+
obj_type = t
|
|
74
|
+
if obj_type not in _TYPE_CACHE:
|
|
75
|
+
if hasattr(data, "model_dump"): _TYPE_CACHE[obj_type] = "m2"
|
|
76
|
+
elif hasattr(data, "dict"): _TYPE_CACHE[obj_type] = "m1"
|
|
77
|
+
else: _TYPE_CACHE[obj_type] = "p"
|
|
78
|
+
|
|
79
|
+
mode = _TYPE_CACHE[obj_type]
|
|
80
|
+
if mode == "m2": return _to_plain(data.model_dump())
|
|
81
|
+
if mode == "m1": return _to_plain(data.dict())
|
|
82
|
+
return data
|
|
83
|
+
|
|
84
|
+
def _encode(data: Any, mapper: KeyMapper) -> str:
|
|
85
|
+
"""
|
|
86
|
+
Internal recursive engine for structural tokenization.
|
|
87
|
+
Employs Delta-Encoding for homogenous dict lists.
|
|
88
|
+
"""
|
|
89
|
+
t = type(data)
|
|
90
|
+
if t in (str, int, float, bool) or data is None:
|
|
91
|
+
return _val_to_str(data, mapper)
|
|
92
|
+
if t is dict:
|
|
93
|
+
# Map keys to tokens and recurse into values.
|
|
94
|
+
items = [f"{mapper.key_to_token[k]}:{_encode(v, mapper)}" for k, v in data.items()]
|
|
95
|
+
return "{" + ",".join(items) + "}"
|
|
96
|
+
if t is list:
|
|
97
|
+
if not data: return "[]"
|
|
98
|
+
if all(type(i) is dict for i in data):
|
|
99
|
+
# Divert to specialized delta engine for redundant collections.
|
|
100
|
+
return encode_delta(data, mapper, _encode)
|
|
101
|
+
return "[" + ",".join([_encode(i, mapper) for i in data]) + "]"
|
|
102
|
+
return _val_to_str(data, mapper)
|
siglax/decoder.py
ADDED
|
@@ -0,0 +1,213 @@
|
|
|
1
|
+
from typing import Any, Dict, List, Tuple
|
|
2
|
+
import re
|
|
3
|
+
|
|
4
|
+
class Decoder:
|
|
5
|
+
"""
|
|
6
|
+
Inverse Transformation Engine.
|
|
7
|
+
Reconstructs original Python data structures from serialized sigla-x payloads.
|
|
8
|
+
"""
|
|
9
|
+
def __init__(self, header: str):
|
|
10
|
+
self.key_map = {}
|
|
11
|
+
self.val_map = {}
|
|
12
|
+
self._parse_header(header)
|
|
13
|
+
|
|
14
|
+
def _parse_header(self, header: str):
|
|
15
|
+
"""
|
|
16
|
+
Parses the Absolute Quoted schema prefix.
|
|
17
|
+
Populates internal mapping tables for key and value token expansion.
|
|
18
|
+
"""
|
|
19
|
+
content = header[1:-1]
|
|
20
|
+
if not content: return
|
|
21
|
+
idx = 0
|
|
22
|
+
while idx < len(content):
|
|
23
|
+
# Tokens are isolated in double quotes.
|
|
24
|
+
if content[idx] == '"':
|
|
25
|
+
token, consumed = self._decode_quoted_string(content[idx:])
|
|
26
|
+
idx += consumed
|
|
27
|
+
else: break
|
|
28
|
+
|
|
29
|
+
if idx < len(content) and content[idx] == '=': idx += 1
|
|
30
|
+
else: break
|
|
31
|
+
|
|
32
|
+
# Values are isolated in double quotes.
|
|
33
|
+
if idx < len(content) and content[idx] == '"':
|
|
34
|
+
val, consumed = self._decode_quoted_string(content[idx:])
|
|
35
|
+
idx += consumed
|
|
36
|
+
else: break
|
|
37
|
+
|
|
38
|
+
if token.startswith('_'): self.val_map[token[1:]] = val
|
|
39
|
+
else: self.key_map[token] = val
|
|
40
|
+
|
|
41
|
+
if idx < len(content) and content[idx] == ',': idx += 1
|
|
42
|
+
|
|
43
|
+
def decode(self, body: str) -> Any:
|
|
44
|
+
"""
|
|
45
|
+
Primary entry point for payload reconstruction.
|
|
46
|
+
"""
|
|
47
|
+
if body is None: return None
|
|
48
|
+
if body == '': return ''
|
|
49
|
+
res, _ = self._decode_recursive(body)
|
|
50
|
+
return res
|
|
51
|
+
|
|
52
|
+
def _decode_recursive(self, s: str) -> Tuple[Any, int]:
|
|
53
|
+
"""
|
|
54
|
+
Recursive structural analysis engine.
|
|
55
|
+
Identifies and routes segments to specialized type decoders.
|
|
56
|
+
"""
|
|
57
|
+
if not s: return '', 0
|
|
58
|
+
char = s[0]
|
|
59
|
+
if char == '{': return self._decode_dict(s)
|
|
60
|
+
elif char == '[': return self._decode_list(s)
|
|
61
|
+
elif char == '(': return self._decode_pvp(s)
|
|
62
|
+
elif char == '"': return self._decode_quoted_string(s)
|
|
63
|
+
elif char == '@':
|
|
64
|
+
# Value token expansion.
|
|
65
|
+
match = re.match(r'@([a-zA-Z0-9]+)', s)
|
|
66
|
+
if match:
|
|
67
|
+
token = match.group(1)
|
|
68
|
+
return self.val_map.get(token, token), len(token) + 1
|
|
69
|
+
return "@", 1
|
|
70
|
+
else:
|
|
71
|
+
# Primitive parsing stops at any structural delimiter.
|
|
72
|
+
match = re.search(r'[,}\])|]', s)
|
|
73
|
+
if match:
|
|
74
|
+
end = match.start()
|
|
75
|
+
val = s[:end]
|
|
76
|
+
return self._parse_primitive(val), end
|
|
77
|
+
return self._parse_primitive(s), len(s)
|
|
78
|
+
|
|
79
|
+
def _decode_quoted_string(self, s: str) -> Tuple[Any, int]:
|
|
80
|
+
"""
|
|
81
|
+
Decodes isolated string segments, handling internal escapes and numeric protocols.
|
|
82
|
+
"""
|
|
83
|
+
res = []
|
|
84
|
+
idx = 1
|
|
85
|
+
while idx < len(s):
|
|
86
|
+
if s[idx] == '"':
|
|
87
|
+
val_str = "".join(res)
|
|
88
|
+
return self._parse_primitive_quoted(val_str), idx + 1
|
|
89
|
+
if s[idx] == '\\' and idx + 1 < len(s):
|
|
90
|
+
res.append(s[idx+1]); idx += 2
|
|
91
|
+
else:
|
|
92
|
+
res.append(s[idx]); idx += 1
|
|
93
|
+
return "".join(res), idx
|
|
94
|
+
|
|
95
|
+
def _parse_primitive_quoted(self, s: str) -> Any:
|
|
96
|
+
"""
|
|
97
|
+
Decodes the '#' Numeric Escape Protocol.
|
|
98
|
+
Distinguishes literal strings from escaped integers and floats.
|
|
99
|
+
"""
|
|
100
|
+
if s.startswith('#'):
|
|
101
|
+
numeric_part = s[1:]
|
|
102
|
+
try:
|
|
103
|
+
if numeric_part.lower() in ('inf', '-inf', 'nan'): return float(numeric_part)
|
|
104
|
+
if '.' in numeric_part or 'e' in numeric_part.lower(): return float(numeric_part)
|
|
105
|
+
return int(numeric_part)
|
|
106
|
+
except ValueError: return s
|
|
107
|
+
return s
|
|
108
|
+
|
|
109
|
+
def _decode_dict(self, s: str) -> Tuple[Dict, int]:
|
|
110
|
+
"""
|
|
111
|
+
Reconstructs dictionary structures from tokenized segments.
|
|
112
|
+
"""
|
|
113
|
+
res = {}
|
|
114
|
+
idx = 1
|
|
115
|
+
while idx < len(s) and s[idx] != '}':
|
|
116
|
+
colon_idx = s.find(':', idx)
|
|
117
|
+
if colon_idx == -1: break
|
|
118
|
+
brace_idx = s.find('}', idx)
|
|
119
|
+
if brace_idx != -1 and brace_idx < colon_idx: break
|
|
120
|
+
|
|
121
|
+
token = s[idx:colon_idx]
|
|
122
|
+
key = self.key_map.get(token, token)
|
|
123
|
+
val, consumed = self._decode_recursive(s[colon_idx+1:])
|
|
124
|
+
res[key] = val
|
|
125
|
+
idx = colon_idx + 1 + consumed
|
|
126
|
+
if idx < len(s) and s[idx] == ',': idx += 1
|
|
127
|
+
return res, idx + 1
|
|
128
|
+
|
|
129
|
+
def _decode_list(self, s: str) -> Tuple[List, int]:
|
|
130
|
+
"""
|
|
131
|
+
Reconstructs list collections, handling both standard and Delta-Encoded formats.
|
|
132
|
+
"""
|
|
133
|
+
idx = 1
|
|
134
|
+
is_delta = False
|
|
135
|
+
if idx < len(s):
|
|
136
|
+
if s[idx] == ']':
|
|
137
|
+
# Empty base followed by delta blocks.
|
|
138
|
+
if idx + 1 < len(s) and s[idx+1] in ('{', '('): is_delta = True
|
|
139
|
+
else:
|
|
140
|
+
# Delta block with common key-value pairs.
|
|
141
|
+
match = re.match(r'^[a-zA-Z0-9]+:', s[idx:])
|
|
142
|
+
if match: is_delta = True
|
|
143
|
+
|
|
144
|
+
if is_delta:
|
|
145
|
+
common_kv = {}
|
|
146
|
+
while idx < len(s) and s[idx] != ']':
|
|
147
|
+
colon_idx = s.find(':', idx)
|
|
148
|
+
if colon_idx == -1: break
|
|
149
|
+
token = s[idx:colon_idx]
|
|
150
|
+
key = self.key_map.get(token, token)
|
|
151
|
+
val, consumed = self._decode_recursive(s[colon_idx+1:])
|
|
152
|
+
common_kv[key] = val
|
|
153
|
+
idx = colon_idx + 1 + consumed
|
|
154
|
+
if idx < len(s) and s[idx] == ',': idx += 1
|
|
155
|
+
idx += 1
|
|
156
|
+
res = []
|
|
157
|
+
# Process remaining item diffs or PVP blocks.
|
|
158
|
+
while idx < len(s) and (s[idx] == '{' or s[idx] == '('):
|
|
159
|
+
if s[idx] == '{':
|
|
160
|
+
diff, consumed = self._decode_dict(s[idx:])
|
|
161
|
+
item = common_kv.copy(); item.update(diff); res.append(item)
|
|
162
|
+
idx += consumed
|
|
163
|
+
elif s[idx] == '(':
|
|
164
|
+
pvp_res, consumed = self._decode_pvp(s[idx:], common_kv)
|
|
165
|
+
res.extend(pvp_res); idx += consumed
|
|
166
|
+
return res, idx
|
|
167
|
+
else:
|
|
168
|
+
res = []
|
|
169
|
+
while idx < len(s) and s[idx] != ']':
|
|
170
|
+
val, consumed = self._decode_recursive(s[idx:])
|
|
171
|
+
res.append(val)
|
|
172
|
+
idx += consumed
|
|
173
|
+
if idx < len(s) and s[idx] == ',': idx += 1
|
|
174
|
+
return res, idx + 1
|
|
175
|
+
|
|
176
|
+
def _decode_pvp(self, s: str, common_kv: Dict = None) -> Tuple[List, int]:
|
|
177
|
+
"""
|
|
178
|
+
Reconstructs collections from the high-density Positional Value Protocol.
|
|
179
|
+
"""
|
|
180
|
+
idx = 1
|
|
181
|
+
schema_tokens = []
|
|
182
|
+
# Identify positional keys from the schema block.
|
|
183
|
+
while idx < len(s) and s[idx] != ')':
|
|
184
|
+
comma_idx = s.find(',', idx); end_idx = s.find(')', idx)
|
|
185
|
+
token_end = min(comma_idx, end_idx) if comma_idx != -1 else end_idx
|
|
186
|
+
schema_tokens.append(s[idx:token_end]); idx = token_end
|
|
187
|
+
if idx < len(s) and s[idx] == ',': idx += 1
|
|
188
|
+
idx += 1
|
|
189
|
+
|
|
190
|
+
keys = [self.key_map.get(t, t) for t in schema_tokens]
|
|
191
|
+
res = []
|
|
192
|
+
# Reconstruct items by applying positional values to the schema.
|
|
193
|
+
while idx < len(s) and s[idx] == '(':
|
|
194
|
+
idx += 1; item = common_kv.copy() if common_kv else {}
|
|
195
|
+
for key in keys:
|
|
196
|
+
val, consumed = self._decode_recursive(s[idx:])
|
|
197
|
+
item[key] = val; idx += consumed
|
|
198
|
+
if idx < len(s) and s[idx] == ',': idx += 1
|
|
199
|
+
res.append(item); idx += 1
|
|
200
|
+
return res, idx
|
|
201
|
+
|
|
202
|
+
def _parse_primitive(self, s: str) -> Any:
|
|
203
|
+
"""
|
|
204
|
+
Maps unquoted tokens to primitive Python types.
|
|
205
|
+
"""
|
|
206
|
+
if s == "1": return True
|
|
207
|
+
if s == "0": return False
|
|
208
|
+
if s == "~": return None
|
|
209
|
+
if s == "": return ""
|
|
210
|
+
try: return int(s)
|
|
211
|
+
except ValueError:
|
|
212
|
+
try: return float(s)
|
|
213
|
+
except ValueError: return s
|
siglax/delta.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
1
|
+
from typing import List, Dict, Any
|
|
2
|
+
|
|
3
|
+
def encode_delta(items: List[Dict[str, Any]], mapper, encode_fn) -> str:
|
|
4
|
+
"""
|
|
5
|
+
Implements Delta-Encoding and Positional Value Protocol (PVP).
|
|
6
|
+
Reduces structural redundancy in list collections by extracting common key-value pairs.
|
|
7
|
+
"""
|
|
8
|
+
if not items: return "[]"
|
|
9
|
+
|
|
10
|
+
# Identify key-value pairs that are identical across the entire collection.
|
|
11
|
+
common_kv = {}
|
|
12
|
+
first_item = items[0]
|
|
13
|
+
for k, v in first_item.items():
|
|
14
|
+
if all(k in item and item[k] == v for item in items[1:]):
|
|
15
|
+
common_kv[k] = v
|
|
16
|
+
|
|
17
|
+
# Identify all dynamic keys present in the collection.
|
|
18
|
+
all_keys = set()
|
|
19
|
+
for item in items:
|
|
20
|
+
all_keys.update(item.keys())
|
|
21
|
+
|
|
22
|
+
dynamic_keys = sorted([k for k in all_keys if k not in common_kv])
|
|
23
|
+
|
|
24
|
+
# Homogeneity check determines if the collection can utilize the high-density PVP.
|
|
25
|
+
first_keys_sorted = sorted(items[0].keys())
|
|
26
|
+
is_homogenous = all(sorted(item.keys()) == first_keys_sorted for item in items)
|
|
27
|
+
|
|
28
|
+
# Serialize common elements into the base structural block.
|
|
29
|
+
mapped_common = [f"{mapper.key_to_token[k]}:{encode_fn(v, mapper)}" for k, v in common_kv.items()]
|
|
30
|
+
base = "[" + ",".join(mapped_common) + "]"
|
|
31
|
+
|
|
32
|
+
# PVP (Positional Value Protocol)
|
|
33
|
+
# Optimized for long homogenous collections by eliminating key tokens entirely.
|
|
34
|
+
if is_homogenous and len(items) > 5:
|
|
35
|
+
key_tokens = [mapper.key_to_token[k] for k in dynamic_keys]
|
|
36
|
+
schema = "(" + ",".join(key_tokens) + ")"
|
|
37
|
+
|
|
38
|
+
payloads = []
|
|
39
|
+
for item in items:
|
|
40
|
+
vals = [encode_fn(item[k], mapper) for k in dynamic_keys]
|
|
41
|
+
payloads.append("(" + ",".join(vals) + ")")
|
|
42
|
+
return base + schema + "".join(payloads)
|
|
43
|
+
|
|
44
|
+
# Standard Delta-Encoding
|
|
45
|
+
# Encodes only the keys that differ from the common block for each item.
|
|
46
|
+
bodies = []
|
|
47
|
+
for item in items:
|
|
48
|
+
item_dynamic_keys = [k for k in dynamic_keys if k in item]
|
|
49
|
+
diff_pairs = [f"{mapper.key_to_token[k]}:{encode_fn(item[k], mapper)}" for k in item_dynamic_keys]
|
|
50
|
+
bodies.append("{" + ",".join(diff_pairs) + "}")
|
|
51
|
+
|
|
52
|
+
return base + "".join(bodies)
|
|
53
|
+
|
|
54
|
+
def _val_to_str(val: Any, mapper: Any) -> str:
|
|
55
|
+
"""
|
|
56
|
+
Primitive serialization engine.
|
|
57
|
+
Implements numeric escaping and structural quoting to ensure absolute type parity.
|
|
58
|
+
"""
|
|
59
|
+
if val is True: return "1"
|
|
60
|
+
if val is False: return "0"
|
|
61
|
+
if val is None: return "~"
|
|
62
|
+
|
|
63
|
+
t = type(val)
|
|
64
|
+
if t is int or t is float:
|
|
65
|
+
s_val = str(val)
|
|
66
|
+
# Reserved token collision check.
|
|
67
|
+
reserved = ("1", "0", "~")
|
|
68
|
+
|
|
69
|
+
# Numeric escape isolation: integer '1' must not be confused with boolean 'True'.
|
|
70
|
+
must_quote = s_val in reserved
|
|
71
|
+
if not must_quote:
|
|
72
|
+
# Detect structural delimiters or scientific notation requiring isolation.
|
|
73
|
+
delimiters = ",:{}[]()|^@~\""
|
|
74
|
+
must_quote = 'e' in s_val.lower() or any(c in s_val for c in delimiters)
|
|
75
|
+
|
|
76
|
+
if must_quote:
|
|
77
|
+
return f'"#{s_val}"'
|
|
78
|
+
return s_val
|
|
79
|
+
|
|
80
|
+
if t is str:
|
|
81
|
+
# Explicit quoting for empty strings to prevent parsing ambiguity.
|
|
82
|
+
if val == "":
|
|
83
|
+
return '""'
|
|
84
|
+
if val in mapper.val_to_token:
|
|
85
|
+
return f"@{mapper.val_to_token[val]}"
|
|
86
|
+
|
|
87
|
+
# Detect if string content requires structural isolation.
|
|
88
|
+
looks_like_primitive = val in ("1", "0", "~")
|
|
89
|
+
if not looks_like_primitive:
|
|
90
|
+
try:
|
|
91
|
+
float(val)
|
|
92
|
+
looks_like_primitive = True
|
|
93
|
+
except ValueError:
|
|
94
|
+
pass
|
|
95
|
+
|
|
96
|
+
delimiters = ",:{}[]()|^@~\""
|
|
97
|
+
if looks_like_primitive or any(c in val for c in delimiters):
|
|
98
|
+
# Escape literal backslashes and quotes within the isolated string.
|
|
99
|
+
escaped = val.replace('\\', '\\\\').replace('"', '\\"')
|
|
100
|
+
return f'"{escaped}"'
|
|
101
|
+
return val
|
|
102
|
+
|
|
103
|
+
return str(val)
|
siglax/mapper.py
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
from typing import Dict, Any, List
|
|
2
|
+
from collections import Counter
|
|
3
|
+
|
|
4
|
+
def _header_escape(s: str) -> str:
|
|
5
|
+
"""
|
|
6
|
+
Implements Absolute Quoting for header elements to prevent structural interference.
|
|
7
|
+
"""
|
|
8
|
+
return f'"{s.replace("\\", "\\\\").replace("\"", "\\\"")}"'
|
|
9
|
+
|
|
10
|
+
class KeyMapper:
|
|
11
|
+
"""
|
|
12
|
+
Deterministic Translation Engine.
|
|
13
|
+
Maps high-frequency keys and values to single-character tokens to minimize entropy.
|
|
14
|
+
"""
|
|
15
|
+
__slots__ = ('key_to_token', 'val_to_token', '_char_idx')
|
|
16
|
+
|
|
17
|
+
CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
|
|
18
|
+
|
|
19
|
+
def __init__(self, data: Any):
|
|
20
|
+
self.key_to_token: Dict[str, str] = {}
|
|
21
|
+
self.val_to_token: Dict[str, str] = {}
|
|
22
|
+
self._char_idx = 0
|
|
23
|
+
|
|
24
|
+
keys = Counter()
|
|
25
|
+
vals = Counter()
|
|
26
|
+
self._scan(data, keys, vals)
|
|
27
|
+
|
|
28
|
+
# Deterministic allocation ensures stable header generation.
|
|
29
|
+
# Sort by frequency (DESC) then key (ASC) to guarantee consistency.
|
|
30
|
+
sorted_keys = sorted(keys.items(), key=lambda x: (-x[1], x[0]))
|
|
31
|
+
for key, _ in sorted_keys:
|
|
32
|
+
if self._char_idx < 62:
|
|
33
|
+
self.key_to_token[key] = self.CHARS[self._char_idx]
|
|
34
|
+
self._char_idx += 1
|
|
35
|
+
else:
|
|
36
|
+
self.key_to_token[key] = f"z{self._char_idx}"
|
|
37
|
+
self._char_idx += 1
|
|
38
|
+
|
|
39
|
+
# Value tokenization applies to redundant strings exceeding threshold length.
|
|
40
|
+
sorted_vals = sorted(vals.items(), key=lambda x: (-x[1], x[0]))
|
|
41
|
+
for val, count in sorted_vals:
|
|
42
|
+
if count > 1 and len(val) > 3:
|
|
43
|
+
if self._char_idx < 62:
|
|
44
|
+
self.val_to_token[val] = self.CHARS[self._char_idx]
|
|
45
|
+
self._char_idx += 1
|
|
46
|
+
else:
|
|
47
|
+
self.val_to_token[val] = f"z{self._char_idx}"
|
|
48
|
+
self._char_idx += 1
|
|
49
|
+
|
|
50
|
+
def _scan(self, data: Any, keys: Counter, vals: Counter):
|
|
51
|
+
"""
|
|
52
|
+
Structural analysis pass to quantify token frequency.
|
|
53
|
+
"""
|
|
54
|
+
t = type(data)
|
|
55
|
+
if t is dict:
|
|
56
|
+
for k, v in data.items():
|
|
57
|
+
keys[k] += 1
|
|
58
|
+
self._scan(v, keys, vals)
|
|
59
|
+
elif t is list:
|
|
60
|
+
for item in data:
|
|
61
|
+
self._scan(item, keys, vals)
|
|
62
|
+
elif t is str:
|
|
63
|
+
vals[data] += 1
|
|
64
|
+
|
|
65
|
+
def get_header(self) -> str:
|
|
66
|
+
"""
|
|
67
|
+
Generates the Absolute Quoted schema prefix for the payload.
|
|
68
|
+
"""
|
|
69
|
+
k_mappings = [f"{_header_escape(v)}={_header_escape(k)}" for k, v in self.key_to_token.items()]
|
|
70
|
+
v_mappings = [f"{_header_escape('_' + v)}={_header_escape(k)}" for k, v in self.val_to_token.items()]
|
|
71
|
+
return "^" + ",".join(k_mappings + v_mappings) + "|"
|