lopace 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
lopace-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Aman Ulla
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,6 @@
1
+ include README.md
2
+ include LICENSE
3
+ include requirements.txt
4
+ recursive-include lopace *.py
5
+ global-exclude __pycache__
6
+ global-exclude *.py[co]
lopace-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,402 @@
1
+ Metadata-Version: 2.4
2
+ Name: lopace
3
+ Version: 0.1.0
4
+ Summary: Lossless Optimized Prompt Accurate Compression Engine
5
+ Home-page: https://github.com/connectaman/LoPace
6
+ Author: Aman Ulla
7
+ License: MIT
8
+ Project-URL: Homepage, https://github.com/amanulla/lopace
9
+ Project-URL: Repository, https://github.com/amanulla/lopace
10
+ Project-URL: Issues, https://github.com/amanulla/lopace/issues
11
+ Keywords: prompt,compression,tokenization,zstd,bpe,nlp
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
15
+ Classifier: Topic :: Text Processing :: Linguistic
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.8
19
+ Classifier: Programming Language :: Python :: 3.9
20
+ Classifier: Programming Language :: Python :: 3.10
21
+ Classifier: Programming Language :: Python :: 3.11
22
+ Classifier: Programming Language :: Python :: 3.12
23
+ Requires-Python: >=3.8
24
+ Description-Content-Type: text/markdown
25
+ License-File: LICENSE
26
+ Requires-Dist: zstandard>=0.22.0
27
+ Requires-Dist: tiktoken>=0.5.0
28
+ Dynamic: home-page
29
+ Dynamic: license-file
30
+ Dynamic: requires-python
31
+
32
+ # LoPace
33
+
34
+ **Lossless Optimized Prompt Accurate Compression Engine**
35
+
36
+ A professional, open-source Python package for compressing and decompressing prompts using multiple techniques: Zstd, Token-based (BPE), and Hybrid methods. Achieve up to 80% space reduction while maintaining perfect lossless reconstruction.
37
+
38
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
39
+ [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
40
+
41
+ ## Features
42
+
43
+ - 🚀 **Three Compression Methods**:
44
+ - **Zstd**: Dictionary-based compression using Zstandard algorithm
45
+ - **Token**: Byte-Pair Encoding (BPE) tokenization with binary packing
46
+ - **Hybrid**: Combination of tokenization and Zstd (best compression ratio)
47
+
48
+ - ✅ **Lossless**: Perfect reconstruction of original prompts
49
+ - 📊 **Compression Statistics**: Analyze compression ratios and space savings
50
+ - 🔧 **Simple API**: Easy-to-use interface for all compression methods
51
+ - 🎯 **Database-Ready**: Optimized for storing prompts in databases
52
+
53
+ ## Installation
54
+
55
+ ```bash
56
+ pip install lopace
57
+ ```
58
+
59
+ ### Dependencies
60
+
61
+ - `zstandard>=0.22.0` - For Zstd compression
62
+ - `tiktoken>=0.5.0` - For BPE tokenization
63
+
64
+ ## Quick Start
65
+
66
+ ```python
67
+ from lopace import PromptCompressor, CompressionMethod
68
+
69
+ # Initialize compressor
70
+ compressor = PromptCompressor(model="cl100k_base", zstd_level=15)
71
+
72
+ # Your prompt
73
+ prompt = "You are a helpful AI assistant..."
74
+
75
+ # Compress using hybrid method (recommended)
76
+ compressed = compressor.compress(prompt, CompressionMethod.HYBRID)
77
+
78
+ # Decompress back to original
79
+ original = compressor.decompress(compressed, CompressionMethod.HYBRID)
80
+
81
+ # Verify losslessness
82
+ assert original == prompt # ✓ True
83
+ ```
84
+
85
+ ## Usage Examples
86
+
87
+ ### Basic Compression/Decompression
88
+
89
+ ```python
90
+ from lopace import PromptCompressor, CompressionMethod
91
+
92
+ compressor = PromptCompressor()
93
+
94
+ # Compress and return both original and compressed
95
+ original, compressed = compressor.compress_and_return_both(
96
+ "Your prompt here",
97
+ CompressionMethod.HYBRID
98
+ )
99
+
100
+ # Decompress
101
+ recovered = compressor.decompress(compressed, CompressionMethod.HYBRID)
102
+ ```
103
+
104
+ ### Using Different Methods
105
+
106
+ ```python
107
+ compressor = PromptCompressor()
108
+
109
+ prompt = "Your system prompt here..."
110
+
111
+ # Method 1: Zstd only
112
+ zstd_compressed = compressor.compress_zstd(prompt)
113
+ zstd_decompressed = compressor.decompress_zstd(zstd_compressed)
114
+
115
+ # Method 2: Token-based (BPE)
116
+ token_compressed = compressor.compress_token(prompt)
117
+ token_decompressed = compressor.decompress_token(token_compressed)
118
+
119
+ # Method 3: Hybrid (recommended - best compression)
120
+ hybrid_compressed = compressor.compress_hybrid(prompt)
121
+ hybrid_decompressed = compressor.decompress_hybrid(hybrid_compressed)
122
+ ```
123
+
124
+ ### Get Compression Statistics
125
+
126
+ ```python
127
+ compressor = PromptCompressor()
128
+ prompt = "Your long system prompt..."
129
+
130
+ # Get stats for all methods
131
+ stats = compressor.get_compression_stats(prompt)
132
+
133
+ print(f"Original Size: {stats['original_size_bytes']} bytes")
134
+ print(f"Original Tokens: {stats['original_size_tokens']}")
135
+
136
+ for method, method_stats in stats['methods'].items():
137
+ print(f"\n{method}:")
138
+ print(f" Compressed: {method_stats['compressed_size_bytes']} bytes")
139
+ print(f" Space Saved: {method_stats['space_saved_percent']:.2f}%")
140
+ ```
141
+
142
+ ## Compression Methods Explained
143
+
144
+ ### 1. Zstd Compression
145
+
146
+ Uses Zstandard's dictionary-based algorithm to find repeated patterns and replace them with shorter references.
147
+
148
+ **Best for**: General text compression, when tokenization overhead is not needed.
149
+
150
+ ```python
151
+ compressed = compressor.compress_zstd(prompt)
152
+ original = compressor.decompress_zstd(compressed)
153
+ ```
154
+
155
+ ### 2. Token-Based Compression
156
+
157
+ Uses Byte-Pair Encoding (BPE) to convert text to token IDs, then packs them as binary data.
158
+
159
+ **Best for**: When you need token IDs anyway, or when working with LLM tokenizers.
160
+
161
+ ```python
162
+ compressed = compressor.compress_token(prompt)
163
+ original = compressor.decompress_token(compressed)
164
+ ```
165
+
166
+ ### 3. Hybrid Compression (Recommended)
167
+
168
+ Combines tokenization and Zstd compression for maximum efficiency:
169
+
170
+ 1. Tokenizes text to reduce redundancy
171
+ 2. Packs tokens as binary (2 bytes per token)
172
+ 3. Applies Zstd compression on the binary data
173
+
174
+ **Best for**: Database storage where maximum compression is needed.
175
+
176
+ ```python
177
+ compressed = compressor.compress_hybrid(prompt)
178
+ original = compressor.decompress_hybrid(compressed)
179
+ ```
180
+
181
+ ## API Reference
182
+
183
+ ### `PromptCompressor`
184
+
185
+ Main compressor class.
186
+
187
+ #### Constructor
188
+
189
+ ```python
190
+ PromptCompressor(
191
+ model: str = "cl100k_base",
192
+ zstd_level: int = 15
193
+ )
194
+ ```
195
+
196
+ **Parameters:**
197
+ - `model`: Tokenizer model name (default: `"cl100k_base"`)
198
+ - Options: `"cl100k_base"`, `"p50k_base"`, `"r50k_base"`, `"gpt2"`, etc.
199
+ - `zstd_level`: Zstd compression level 1-22 (default: `15`)
200
+ - Higher = better compression but slower
201
+
202
+ #### Methods
203
+
204
+ ##### `compress(text: str, method: CompressionMethod) -> bytes`
205
+
206
+ Compress a prompt using the specified method.
207
+
208
+ ##### `decompress(compressed_data: bytes, method: CompressionMethod) -> str`
209
+
210
+ Decompress a compressed prompt.
211
+
212
+ ##### `compress_and_return_both(text: str, method: CompressionMethod) -> Tuple[str, bytes]`
213
+
214
+ Compress and return both original and compressed versions.
215
+
216
+ ##### `get_compression_stats(text: str, method: Optional[CompressionMethod]) -> dict`
217
+
218
+ Get detailed compression statistics for analysis.
219
+
220
+ ### `CompressionMethod`
221
+
222
+ Enumeration of available compression methods:
223
+
224
+ - `CompressionMethod.ZSTD` - Zstandard compression
225
+ - `CompressionMethod.TOKEN` - Token-based compression
226
+ - `CompressionMethod.HYBRID` - Hybrid compression (recommended)
227
+
228
+ ## How It Works
229
+
230
+ ### Compression Pipeline (Hybrid Method)
231
+
232
+ ```
233
+ Input: Raw System Prompt String (100%)
234
+
235
+ Tokenization: Convert to Tiktoken IDs (~70% reduced)
236
+
237
+ Binary Packing: Convert IDs to uint16 (~50% of above)
238
+
239
+ Zstd: Final compression (~30% further reduction)
240
+
241
+ Output: Compressed Binary Blob
242
+ ```
243
+
244
+ ### Why Hybrid is Best for Databases
245
+
246
+ 1. **Searchability**: Token IDs can be searched without full decompression
247
+ 2. **Consistency**: Fixed tokenizer ensures stable compression ratios
248
+ 3. **Efficiency**: Maximum space savings for millions of prompts
249
+
250
+ ## Example Output
251
+
252
+ ```python
253
+ # Original prompt: 500 bytes
254
+ # After compression:
255
+ # Zstd: 180 bytes (64% space saved)
256
+ # Token: 240 bytes (52% space saved)
257
+ # Hybrid: 120 bytes (76% space saved) ← Best!
258
+ ```
259
+
260
+ ## Running the Example
261
+
262
+ ```bash
263
+ python example.py
264
+ ```
265
+
266
+ This will demonstrate all compression methods and show statistics.
267
+
268
+ ## Interactive Web App (Streamlit)
269
+
270
+ LoPace includes an interactive Streamlit web application with comprehensive evaluation metrics:
271
+
272
+ ### Features
273
+
274
+ - **Interactive Interface**: Enter prompts and see real-time compression results
275
+ - **Comprehensive Metrics**: All four industry-standard metrics:
276
+ - Compression Ratio (CR): $CR = \frac{S_{original}}{S_{compressed}}$
277
+ - Space Savings (SS): $SS = 1 - \frac{S_{compressed}}{S_{original}}$
278
+ - Bits Per Character (BPC): $BPC = \frac{Total Bits}{Total Characters}$
279
+ - Throughput (MB/s): $T = \frac{Data Size}{Time}$
280
+ - **Lossless Verification**:
281
+ - SHA-256 Hash Verification
282
+ - Exact Match (Character-by-Character)
283
+ - Reconstruction Error: $E = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(x_i \neq \hat{x}_i) = 0$
284
+ - **Side-by-Side Comparison**: Compare all three compression methods
285
+ - **Real-time Configuration**: Adjust tokenizer model and Zstd level
286
+
287
+ ### Running the Streamlit App
288
+
289
+ ```bash
290
+ streamlit run streamlit_app.py
291
+ ```
292
+
293
+ The app will open in your default web browser at `http://localhost:8501`
294
+
295
+ ### Screenshot Preview
296
+
297
+ The app features:
298
+ - **Left Panel**: Text input area for entering prompts
299
+ - **Right Panel**: Results with tabs for each compression method
300
+ - **Metrics Dashboard**: Real-time calculation of all evaluation metrics
301
+ - **Verification Section**: Hash matching and exact match verification
302
+ - **Comparison Table**: Side-by-side comparison of all methods
303
+
304
+ ## Development
305
+
306
+ ### Setup Development Environment
307
+
308
+ ```bash
309
+ git clone https://github.com/amanulla/lopace.git
310
+ cd lopace
311
+ pip install -r requirements-dev.txt
312
+ ```
313
+
314
+ ### Running Tests
315
+
316
+ ```bash
317
+ pytest
318
+ ```
319
+
320
+ ### CI/CD Pipeline
321
+
322
+ This project uses GitHub Actions for automated testing and publishing:
323
+
324
+ - **Tests run automatically** on every push and pull request
325
+ - **Publishing to PyPI** happens automatically when:
326
+ - All tests pass ✅
327
+ - Push is to `main`/`master` branch or a version tag (e.g., `v0.1.0`)
328
+
329
+ See [.github/workflows/README.md](.github/workflows/README.md) for detailed setup instructions.
330
+
331
+ ## Mathematical Background
332
+
333
+ ### Compression Techniques Used
334
+
335
+ LoPace uses the following compression techniques:
336
+
337
+ 1. **LZ77 (Sliding Window)**: Used **indirectly** through Zstandard
338
+ - Zstandard internally uses LZ77-style algorithms to find repeated patterns
339
+ - Instead of storing "assistant" again, it stores a tuple: (distance_back, length)
340
+ - We use this by calling `zstandard.compress()` - the LZ77 is handled internally
341
+
342
+ 2. **Huffman Coding / FSE (Finite State Entropy)**: Used **indirectly** through Zstandard
343
+ - Zstandard uses FSE, a variant of Huffman coding
344
+ - Assigns shorter binary codes to characters/patterns that appear most frequently
345
+ - Again, handled internally by the zstandard library
346
+
347
+ 3. **BPE Tokenization**: Used **directly** via tiktoken
348
+ - Byte-Pair Encoding converts text to token IDs
349
+ - Reduces vocabulary size before compression
350
+ - Implemented by OpenAI's tiktoken library
351
+
352
+ ### Shannon Entropy
353
+
354
+ The theoretical compression limit is determined by Shannon Entropy:
355
+
356
+ $H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i)$
357
+
358
+ Where:
359
+ - $H(X)$ is the entropy of the source
360
+ - $P(x_i)$ is the probability of character/pattern $x_i$
361
+
362
+ LoPace **calculates** Shannon Entropy to show theoretical compression limits:
363
+
364
+ ```python
365
+ compressor = PromptCompressor()
366
+ entropy = compressor.calculate_shannon_entropy("Your prompt")
367
+ limits = compressor.get_theoretical_compression_limit("Your prompt")
368
+ print(f"Theoretical minimum: {limits['theoretical_min_bytes']:.2f} bytes")
369
+ ```
370
+
371
+ This allows you to compare actual compression against the theoretical limit.
372
+
373
+ ## License
374
+
375
+ MIT License - see [LICENSE](LICENSE) file for details.
376
+
377
+ ## Contributing
378
+
379
+ Contributions are welcome! We appreciate your help in making LoPace better.
380
+
381
+ Please read our [Contributing Guidelines](CONTRIBUTING.md) and [Code of Conduct](CODE_OF_CONDUCT.md) before contributing.
382
+
383
+ ### Quick Start for Contributors
384
+
385
+ 1. Fork the repository
386
+ 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
387
+ 3. Make your changes
388
+ 4. Run tests (`pytest tests/ -v`)
389
+ 5. Commit your changes (`git commit -m 'Add amazing feature'`)
390
+ 6. Push to the branch (`git push origin feature/amazing-feature`)
391
+ 7. Open a Pull Request
392
+
393
+ For more details, see [CONTRIBUTING.md](CONTRIBUTING.md).
394
+
395
+ ## Author
396
+
397
+ Aman Ulla
398
+
399
+ ## Acknowledgments
400
+
401
+ - Built on top of [zstandard](https://github.com/facebook/zstd) and [tiktoken](https://github.com/openai/tiktoken)
402
+ - Inspired by the need for efficient prompt storage in LLM applications