lzwrb 0.1.0 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (6) hide show
  1. checksums.yaml +4 -4
  2. data/.yardopts +1 -0
  3. data/CHANGELOG.md +31 -0
  4. data/README.md +286 -0
  5. data/lib/lzwrb.rb +110 -40
  6. metadata +15 -7
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3ca54b9ac48840e65fc5d307707491fc3f06dae167a62fb14e1700877a169cc8
4
- data.tar.gz: ca28f83f51533b04d093f79cf16f4a23cb02a646d496e48ef7f98802128290f8
3
+ metadata.gz: 1227a41fbcf1834033772b23027c96f36a8ec6a8b4c3840aa67d0e21c407326e
4
+ data.tar.gz: 391e9884fa03d762764d9e118c2b7b977eb9b90c3eeea2460d60e42245b45616
5
5
  SHA512:
6
- metadata.gz: 0a146638c2720f8cc5b2e6d8af24f461ddad63d09eb4d86660897a73f7832cdf1439ef53aeb9e52754a536e5d36b5f4799b488864372195a0b5c47154db76b86
7
- data.tar.gz: 663e70084a893fa73b210ead73e91656f16753f00da72ac00e7f1474e861f65f98b2b81e048959daa8938c0a227c64338f14c463526589573b4ccaad163da079
6
+ metadata.gz: 02b472ee9e3ca47e1da3ef1c9aa3c33606dd84759121534c4e443299e8281b01b5a162aff9c00c5f4e129e360dbce693d90b66ced81452eaa22ebbcba167431b
7
+ data.tar.gz: 67dcb5a8d45113ca2c356b4502efcc9e493a0a67d5dafc03557c2020fa771cb92bfde7fb5cc8de3db79da7a6cae2adfcbf88a191307f82af58d0da4fb0914d57
data/.yardopts ADDED
@@ -0,0 +1 @@
1
+ lib/**/*.rb -m markdown --no-private - CHANGELOG.md docs/**/*
data/CHANGELOG.md ADDED
@@ -0,0 +1,31 @@
1
+ # Changelog
2
+
3
+ ### 0.2.2 (03/Mar/2024)
4
+
5
+ Minor encoding optimization (~10% speed increase)
6
+
7
+ ### 0.2.1 (03/Feb/2024)
8
+
9
+ Documentation fix.
10
+
11
+ ### 0.2.0 (03/Feb/2024)
12
+
13
+ - Adds: Binary vs textual encoding modes.
14
+ - Adds: Rake tests.
15
+ - Adds: YARD docs.
16
+ - Adds: Debug logging.
17
+ - Fixes: Encoding of Unicode strings in binary mode.
18
+ - Fixes: Safe guard.
19
+
20
+ ### 0.1.0 (02/Feb/2024)
21
+
22
+ Initial release. Already includes:
23
+
24
+ - Working encoding and decoding.
25
+ - Constant and variable code length support.
26
+ - Custom alphabet support.
27
+ - Optional Clear/Stop codes support, including deferred Clear codes.
28
+ - Only Least Significant Bit packing order.
29
+ - Full GIF specs support.
30
+ - Presets.
31
+
data/README.md ADDED
@@ -0,0 +1,286 @@
1
+ # lzwrb
2
+
3
+ [![Gem Version](https://img.shields.io/gem/v/lzwrb.svg)](https://rubygems.org/gems/lzwrb)
4
+ [![Gem](https://img.shields.io/gem/dt/lzwrb.svg)](https://rubygems.org/gems/lzwrb)
5
+ [![Documentation](https://img.shields.io/badge/docs-grey.svg)](https://www.rubydoc.info/gems/lzwrb/)
6
+
7
+ lzwrb is a Ruby gem for LZW encoding and decoding. Main features:
8
+
9
+ * Pure Ruby, no dependencies.
10
+ * Highly configurable (constant/variable code length, bit packing order...). See [Configuration](#configuration).
11
+ * Compatible with many LZW specifications, including GIF (see [Presets](#presets)).
12
+ * Reasonably fast for pure Ruby (see [Benchmarks](#benchmarks)).
13
+
14
+ See also: [Documentation](https://www.rubydoc.info/gems/lzwrb/), [Changelog](https://github.com/edelkas/lzwrb/blob/master/CHANGELOG.md).
15
+
16
+ ## Table of contents
17
+
18
+ - [What's LZW?](#what's-lzw)
19
+ - [Installation](#installation)
20
+ - [Basic Usage](#basic-usage)
21
+ - [Configuration](#configuration)
22
+ - [Presets](#presets)
23
+ - [Binary vs textual](#binary-vs-textual)
24
+ - [Code length](#code-length)
25
+ - [Packing order](#packing-order)
26
+ - [Clear & Stop codes](#clear--stop-codes)
27
+ - [Custom alphabets](#custom-alphabets)
28
+ - [Verbosity](#verbosity)
29
+ - [Benchmarks](#benchmarks)
30
+ - [Todo](#todo)
31
+ - [Notes](#notes)
32
+
33
+ ## What's LZW?
34
+
35
+ In short, LZW is a dictionary-based lossless compression algorithm. The encoder scans the data sequentially, building a table of patterns and substituting them in the data by their corresponding code. The decoder, configured with the same settings, will recognize the same patterns and generate the same table, thus being able to recover the original data from the sequence of codes.
36
+
37
+ The original algorithm used constant code lengths (typically 12 bits), but this is wasteful, specially for data that can be represented with significantly fewer bits, thus variable code length was introduced. Another common feature is the usage of special _Clear_ and _Stop_ codes: When the table is full, the encoder and decoder have to reinitialize it, but this can also be done at will using the Clear code. Likewise, the Stop code indicates the end of the data, which may actually be necessary to avoid ambiguity (see [Clear & Stop codes](#clear--stop-codes)). See [Configuration](#configuration) for a more in-depth description of these and other features, and refer to the [Wikipedia article](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch) for a more thorough explanation.
38
+
39
+ Nowadays, LZW has largely been replaced by other algorithms that provide a better compression ratio, notably [Deflate](https://en.wikipedia.org/wiki/Deflate), implemented by the ubiquitous [zlib](https://en.wikipedia.org/wiki/Zlib) and formats like Zip or GZip, and used in a plethora of file formats, like PNG images. Deflate also provides a faster decoding speed, though encoding is usually slower. Nevertheless, LZW still has its place in more classic formats yet in use, like GIF and TIFF images, PDF files, or the classic UNIX compress utility.
40
+
41
+ ## Installation
42
+
43
+ The setup procedure is completely standard.
44
+
45
+ ### Using Bundler
46
+
47
+ With [Bundler](https://bundler.io/#getting-started) just add the following line to your Gemfile:
48
+
49
+ ```ruby
50
+ gem 'lzwrb'
51
+ ```
52
+
53
+ And then install via `bundle install`.
54
+
55
+ ### Using RubyGems
56
+
57
+ With Ruby's package manager, just execute the following command on the terminal:
58
+
59
+ ```
60
+ gem install lzwrb
61
+ ```
62
+
63
+ You many need admin privileges (e.g. `sudo` in Linux). Beware of using sudo if you're using a version manager, like [RVM](https://rvm.io/) or [rbenv](https://github.com/rbenv/rbenv).
64
+
65
+ ## Basic Usage
66
+
67
+ Require the gem, create an `LZWrb` object with the desired settings and run the `encode` and `decode` methods.
68
+
69
+ The following example uses the default configuration, see the next section for a list of all the available settings, which can be provided to the constructor via keyword arguments.
70
+
71
+ ```ruby
72
+ require 'lzwrb'
73
+ lzw = LZWrb.new
74
+ data = 'TOBEORNOTTOBEORTOBEORNOT' * 10000
75
+ puts lzw.decode(lzw.encode(data)) == data.b
76
+ ```
77
+
78
+ Output sample:
79
+ ```
80
+ [09:41:19.628] LZW <- Encoding 234.375KiB with 8-16 bit codes, LSB packing, no special codes, binary mode.
81
+ [09:41:19.767] LZW -> Encoding finished in 0.139s (avg. 13.172 mbit/s).
82
+ [09:41:19.767] LZW -> Encoded data: 4.458KiB (98.10% compression).
83
+ [09:41:19.767] LZW <- Decoding 4.458KiB with 8-16 bit codes, LSB packing, no special codes, binary mode.
84
+ [09:41:19.773] LZW -> Decoding finished in 0.006s (avg. 5.431 mbit/s).
85
+ [09:41:19.773] LZW -> Decoded data: 234.375KiB (98.10% compression).
86
+ true
87
+ ```
88
+
89
+ Note: The output can be reduced or suppressed (or detailed), see [Verbosity](#verbosity).
90
+
91
+ ## Configuration
92
+
93
+ Most of the usual options for the LZW algorithm can be configured individually by supplying the appropriate keyword arguments in the constructor. Additionally, there are several presets available, i.e., a selection of setting values with a specific purpose (e.g. best compression, fastest encoding, compatible with GIF...). Examples:
94
+
95
+ ```ruby
96
+ lzw1 = LZWrb.new(preset: PRESET_GIF)
97
+ lzw2 = LZWrb.new(min_bits: 8, max_bits: 16)
98
+ lzw3 = LZWrb.new(preset: PRESET_FAST, clear: true, stop: true)
99
+ ```
100
+
101
+ Individual settings may be used together with a preset, in which case the individual setting takes preference over the value that may be set by the preset, thus enabling the fine-tuning of a specific preset.
102
+
103
+ The following options are available:
104
+
105
+ Argument | Type | Default | Description
106
+ ------------ | ------- | --------- | ---
107
+ `:alphabet` | Array | `BINARY` | List of characters that compose the data to encode. See [Custom alphabets](#custom-alphabets).
108
+ `:binary` | Boolean | True | Whether to use binary or textual mode. See [Binary vs textual](#binary-vs-textual).
109
+ `:bits` | Integer | None | Code length in bits, for constant length mode. See [Code length](#code-length).
110
+ `:clear` | Boolean | False | Whether to use Clear codes or not. See [Clear & Stop codes](#clear--stop-codes).
111
+ `:deferred` | Boolean | False | Whether to use deferred Clear codes for the decoding process. See [Clear & Stop codes](#clear--stop-codes).
112
+ `:lsb` | Boolean | True | Whether to use LSB or MSB (least/most significant bit packing order). See [Packing order](#packing-order).
113
+ `:max_bits` | Integer | 16 | Maximum code length in bits, for variable length mode. See [Code length](#code-length).
114
+ `:min_bits` | Integer | 8 | Minimum code length in bits, for variable length mode. See [Code length](#code-length).
115
+ `:preset` | Symbol | None | Specifies which preset (set of options) to use. See [Presets](#presets).
116
+ `:safe` | Boolean | False | Perform first pass during encoding process for data integrity verification. See [Custom alphabets](#custom-alphabets).
117
+ `:stop` | Boolean | False | Whether to use Stop codes or not. See [Clear & Stop codes](#clear--stop-codes).
118
+ `:verbosity` | Symbol | `:normal` | Specifies the amount of detail to print to the console. See [Verbosity](#verbosity).
119
+
120
+ ### Presets
121
+
122
+ Presets specify the value for several configuration options at the same time, with a specific goal in mind (typically, to optimize a certain aspect, or to be compatible with certain format). They can be used together with individual settings, in which case the latter take preference.
123
+
124
+ The following presets are currently available:
125
+
126
+ Preset | Description
127
+ --- | ---
128
+ `PRESET_BEST` | Aimed at best compression, uses variable code length (8-16 bits), no special codes and LSB packing.
129
+ `PRESET_FAST` | Aimed at fastest encoding, uses a constant code length of 16 bits, no special codes and LSB packing.
130
+ `PRESET_GIF` | This is the exact specification implemented by the GIF format, uses variable code length (8-12 bits), clear and stop codes, and LSB packing.
131
+
132
+ Their descriptive names are based on the average case. However, it is possible on strange samples for the `:best` compression to actually be worse than many other settings.
133
+
134
+ For example, on highly random samples, most patterns are very short, perhaps only 1 or 2 characters long, and thus substituting them with a long code can end up being counterproductive. In these cases, a smaller code length might be preferable. In fact, on this kind of data LZW encoding may end up increasing the size, and thus not being suitable at all.
135
+
136
+ ### Binary vs textual
137
+
138
+ The encoder and decoder may be run in both `binary` (default) or `textual` mode. Binary mode encodes the bytes of the string, whereas textual mode encodes its characters. Using binary mode with the (default) `BINARY` alphabet (of length 256) suffices to encode any arbitrary input, whereas with textual mode, one has to ensure the provided alphabet includes all the characters used in the input (this could prove tricky for arbitrary Unicode text). On the plus side, textual mode could attain higher compression for Unicode text, given that each character will be assigned a single code.
139
+
140
+ If in doubt, it is recommended to leave the default values for these settings (binary mode and binary alphabet), since it will work out of the box for any input. Note that binary mode is *always* the default, **even** when one of the custom text alphabets is used. Thus, if textual mode wants to be enforced, it needs to be set using the `binary: false` setting, and a suitable alphabet must also be specified (see [Custom alphabets](#custom-alphabets)).
141
+
142
+ The result of the encoding process is always returned as a binary string (i.e., `ASCII-8BIT`-encoded), whereas the result of the decoding process is either a binary string or a Unicode string (i.e., `UTF-8`-encoded), depending on which mode was selected - binary or textual - respectively.
143
+
144
+ ### Code length
145
+
146
+ The different code length parameters specify how many bits are used when packing each of the codes generated by the LZW algorithm into the final output string during the encoding process.
147
+
148
+ The encoder and the decoder need to agree on these values beforehand, otherwise, the decoder will just see garbled binary data when trying to read an encoded stream. This does not pose a problem when using this gem, since the configuration of the `LZWrb` object can only be done during initialization, so as long as both encoding and decoding are done with the same object, the configuration will match.
149
+
150
+ Classically, two different modes are common, *constant length* and *variable length*.
151
+
152
+ - Constant length mode uses the same bit size for all generated codes, and can be configured by setting the `:bits` parameter to whatever positive integer is desired. It has the disadvantage that there are either too few codes available (if the length is small), leading to poor compression and performance (due to the table needing to be refreshed often); or if the length is big to prevent these issues, then all codes take up substantial storage, again potentially leading to poor compression ratios.
153
+ - Variable length mode was introduced to mitigate these issues. It will start with a smaller code size, resulting in more efficiently packed codes, and it will progressively increase the length as required when the table gets full, until eventually reaching a maximum size that will trigger a table refresh, as is the case for the previous mode. To configure this mode, the parameters `:min_bits` and `:max_bits` are provided.
154
+
155
+ If parameters for both modes are provided, then the constant `:bits` will take priority. Variable length mode typically provides a better compression ratio for most data samples, though the specific optimal values depend on the nature of the data, and experimention might be suited to find them. On the other hand, a constant length of 8, 16, 32 or 64 bits provides the fastest speed, since all codes are then aligned with byte boundaries and code packing can be performed with native byte operations.
156
+
157
+ The LZW algorithm requires that the code table be initialized with all 1 character strings first, i.e., the [alphabet](#custom-alphabets) must be included in the table. Therefore, the maximum code length must be big enough to hold the alphabet, and it is also irrelevant if the minimum code size is chosen to be smaller than the alphabet's size, as the first codes will nevertheless take up as many bits as required. This will all be corrected by the gem if the user inputs incorrect code lengths.
158
+
159
+ It is recommended that the maximum code length be at least several bits higher than the size of the alphabet. For example, the default alphabet (binary) requires 8 bits (so that every possible 256 byte values can be encoded), and thus the maximum length should be several bits higher than 8. Otherwise, there will be a significant compression, penalty, as well as a performance penalty, due to the table needing to be refreshed very often due to the lack of slots. It is also pointless in this case to set the minimum code length below 8 bits.
160
+
161
+ The presets provide good ball-park figures for reasonable code lengths, ranging from 8 to 16 bits for arbitrary binary data. If a smaller alphabet is selected, say, a 5 bit one (32 entries, enough to fit all upper-case latin letters, for instance), then smaller lengths could be reasonable as well.
162
+
163
+ ### Packing order
164
+
165
+ As explained, the arbitrary code lengths imply that the resulting codes need to be aligned with byte boundaries, and might take multiple bytes. It is therefore important to select how to pack the corresponding bits of the codes into the resulting output string. Two standard schemes are common:
166
+
167
+ - *Least significant bit* (LSB) packing order will align the least significant bit of the code with the least significant bit available in the last byte of the output, leaving the rest, most significant bits, for the subsequent bytes.
168
+ - *Most significant bit* (MSB) packing order will align the most significant bit of the code with the most significant bit available in the last byte of the output, leaving the rest, least significant bits, for subsequent bytes.
169
+
170
+ Currently, only LSB packing order is implemented, and thus, the `:lsb` configuration parameter is, for now, a placeholder, with the feature being planned.
171
+
172
+ Both packing orders are commonplace, with LSB being used in formats like GIF and utilities like UNIX compress, and MSB being used in formats like PDF and TIFF, for instance.
173
+
174
+ ### Clear & Stop codes
175
+
176
+ Another common feature of the LZW algorithm is the addition of special codes, not corresponding to any specific pattern of the data but, instead, to certain control instructions for the decoder.
177
+
178
+ - The **CLEAR** code specifies to the decoder that the code table should be refreshed immediately. This feature is not technically needed, since the encoder and decoder must agree on all parameters anyways, and therefore are "synchronized", which means they will both refresh the table at the exact same time (whenever it's needed due to the table becoming full).
179
+
180
+ However, this feature allows an intelligent encoder (which this one is *not*) to detect when the current table no longer properly reflects the structure of the data, and would benefit from refreshing the table and getting rid of the old codes, thus also reducing the code size (if variable length mode is used) and boosting the performance (codes are found sooner on smaller tables). This encoding feature could be particularly useful for data with significant pattern changes.
181
+
182
+ The encoder/decoder can be configured to use clear codes with the `:clear` option. Since the encoder is not intelligent, this option simply means that the encoder will output a clear code before refreshing the table when it's full, thus adding a little overhead. It is therefore only useful for compatibility with formats that require the usage of clear codes, like GIF.
183
+
184
+ - The **STOP** code indicates the end of the data, and that the algorithm should therefore terminate. It is only necessary in two scenarios:
185
+
186
+ - If the code length is smaller than 8 bits. In this case, a long string of 0's at the end of the final byte is ambiguous, as it could either be just padding, or actually represent code 0. Therefore, the encoder will force stop codes when the minimum code size is below 8 bits.
187
+ - For compatibility with formats that require stop codes, like GIF.
188
+
189
+ The encoder/decoder can be configured to use stop codes with the `:stop` option. As mentioned, stop codes might be used even this option is not set, or set to false, if the encoder deems it necessary (a warning will be issued in those cases).
190
+
191
+ Another relevant option is **deferred clear codes**. An intelligent encoder can choose not to refresh the code table even when it's full, thus relying on the current one, without adding any new codes, until it deems it useful, at which point it will output a clear code as a signal for the decoder. A decoder would then have to be configured accordingly, such that it doesn't refresh the code table automatically - which is the standard behaviour - unless it receives an explicit clear code.
192
+
193
+ This feature can be set using the `:deferred` option during initialization. It will only affect the decoder, since the encoder is not intelligent and will therefore never choose to output clear codes prematurely nor in deferred fashion, but rather, precisely when the code table is full. Nevertheless, it is necessary if one wants to decode data that was encoded with this feature on.
194
+
195
+ Some formats, like GIF, employ this feature. In fact, it caused a great deal of confusion with developers, so much so that the cover of the GIF89 specification deals with this explanation. To this day, many pieces of software cannot properly decode GIF files that use deferred clear codes, since their decoder automatically refreshes the code table whenever it's full.
196
+
197
+ ### Custom alphabets
198
+
199
+ As mentioned before, this algorithm can work for arbitrary data composed of characters from any character set, or *alphabet*. By default, this gem will use a binary alphabet of size 256 containing all possible byte values, which, therefore, is suitable to encode and decode any arbitrary piece of data.
200
+
201
+ It is the recommended alphabet to use if in doubt or if binary data is meant to be encoded. Even if the data to be encoded is just text, if it is Unicode it is probably still best to use the default binary alphabet.
202
+
203
+ Nevertheless, if data composed by a substantially smaller set of symbols is meant to be encoded, the full alphabet might be overkill, and a smaller alphabet could be better suited, leading to better compression due to the potential to use smaller code lengths. Unicode text can also benefit from a custom alphabet which includes all characters used in it, provided textual mode is used, since in that case, each character will be assigned a single compression code, rather than up to 4.
204
+
205
+ This can be configured with the `:alphabet` option of the constructor, which receives an array with all the characters the data is (supposedly) composed of.
206
+
207
+ For example, if the data to be encoded is composed only of bytes 0 through 7, the array
208
+ ```ruby
209
+ ["\x00", "\x01", "\x02", "\x03", "\x04", "\x05", "\x06", "\x07"]
210
+ ```
211
+ could be used as the alphabet, and a minimum code length of only 3 bits could be set. The gem comes equipped with a few default alphabets as constants, for instance, `HEX_UPPER` contains all 16 possible hex digits in uppercase, and the default alphabet, `BINARY`, contains all possible 256 byte values. See the `LZWrb` class for a full list. Nevertheless, an arbitrary array can be used here, as long as it's composed only of 1-character strings.
212
+
213
+ Sample input:
214
+ ```ruby
215
+ lzw = LZWrb.new(alphabet: LZWrb::LATIN_UPPER)
216
+ data = 'TOBEORNOTTOBEORTOBEORNOT' * 10000
217
+ puts lzw.decode(lzw.encode(data)) == data
218
+ ```
219
+
220
+ Output:
221
+ ```
222
+ [09:48:25.765] LZW <- Encoding 234.375KiB with 5-16 bit codes, LSB packing, STOP codes, textual mode.
223
+ [09:48:25.899] LZW -> Encoding finished in 0.133s (avg. 13.717 mbit/s).
224
+ [09:48:25.899] LZW -> Encoded data: 4.330KiB (98.15% compression).
225
+ [09:48:25.899] LZW <- Decoding 4.330KiB with 5-16 bit codes, LSB packing, STOP codes, textual mode.
226
+ [09:48:25.909] LZW -> Decoding finished in 0.010s (avg. 3.272 mbit/s).
227
+ [09:48:25.909] LZW -> Decoded data: 234.375KiB (98.15% compression).
228
+ true
229
+ ```
230
+
231
+ Note how the encoder automatically selected a minimum code size of 5 bits, which suffices to hold the specified alphabet. Specifying fewer minimum bits doesn't make sense, and will be silently corrected by the encoder as well.
232
+
233
+ Note that if the provided alphabet does not contain all the symbols that compose the data to be encoded, this will result in unexpected and incorrect behaviour; most likely an exception, although in some rare cases it could just result in garbled data. Therefore, the user should do either of the following:
234
+
235
+ - Ensure the data is composed solely by characters from the alphabet. This will never be a problem if the default alphabet (binary) is used, since that can be used to encode arbitrary data. As mentioned, this is the recommended option unless the data can clearly benefit from a smaller alphabet.
236
+ - Use the `:safe` option in the constructor, which will always perform a first pass before encoding to ensure that the provided data is indeed composed only of characters from the alphabet. This option, naturally, incurs in a performance penalty for the encoding process (see [Benchmarks](#benchmarks)) that depends on the length of the data. Thus, whenever speed is crucial, the first option is preferable.
237
+
238
+ ### Verbosity
239
+
240
+ The amount of information logged to the terminal by the encoder and decoder can be set using the `:verbosity` option of the constructor, which takes any of the following symbols:
241
+
242
+ Value | Description
243
+ --- | ---
244
+ `:silent` | Don't log anything.
245
+ `:minimal` | Log only errors.
246
+ `:quiet` | Log errors and warnings.
247
+ `:normal` | Also log brief encoding/decoding info (time, compression ratio, speed...).
248
+ `:debug` | Also log additional debug information of the encoding / decoding process, such as when the code table gets refreshed, or the traces of the exceptions, if any.
249
+
250
+ The default value is `:normal`.
251
+
252
+ ## Benchmarks
253
+
254
+ On mid-range computers (2nd gen Ryzen 5, 7th gen i5) I've obtained the following average speeds:
255
+
256
+ - A low bound for encoding speeds of 5 mbit/s when the data is random, and thus, very hard to encode.
257
+ - A high bound for encoding speeds of 16 mbit/s when the data is very regular, and thus, easy to encode.
258
+ - A fairly consistent decoding speed of 8-9 mbit/s, regardless of the nature of the original data.
259
+
260
+ The optional (and by default disabled) `:safe` setting, which performs a first pass in the encoding process to ensure data integrity (see [Custom alphabets](#custom-alphabets)), seems to incur roughly in a 15% performance penalty, i.e., encoding bitrate is decreased by about 15%. This setting does not affect the decoding process.
261
+
262
+ ## Todo
263
+
264
+ Eventually I'd like to have the following extra features tested and implemented:
265
+
266
+ ### Main features
267
+
268
+ * Most significant bit (MSB) packing order.
269
+ * "Early change" support.
270
+ * Support for TIFF and PDF (requires the 2 features above).
271
+ * Support for UNIX compress (has a bug).
272
+
273
+ ### Smaller additions
274
+
275
+ * Option to disable automatic table reinitialization.
276
+ * Choice between plain and rich logs.
277
+
278
+ ### Optimizations
279
+
280
+ * Native packing for codes of constant width of 8, 16, 32 or 64 bits for extra speed.
281
+ * Try using a Trie instead of a Hash for the encoding process.
282
+
283
+ ## Notes
284
+
285
+ * **Concurrency**: The `LZWrb` objects are _not_ thread safe. If you want to encode/decode in parallel, you must create a separate object for each thread, even if they use the same exact configuration.
286
+ * **Custom alphabets**: If you change the default alphabet (binary), ensure the data to be encoded can be expressed solely with that alphabet, or alternatively, specify the `safe` option to the encoder.
data/lib/lzwrb.rb CHANGED
@@ -1,19 +1,41 @@
1
+ # Holds an LZW encoder/decoder with a specific configuration. These objects are not thread safe, so for
2
+ # concurrent use different ones should be created.
1
3
  class LZWrb
2
4
 
3
- # Default alphabets
5
+ # Alphabet containing the digits 0 to 9
4
6
  DEC = (0...10).to_a.map(&:chr)
7
+
8
+ # Alphabet containing the hex digits 0 to F in uppercase
5
9
  HEX_UPPER = (0...16).to_a.map{ |n| n.to_s(16).upcase }
10
+
11
+ # Alphabet containing the hex digits 0 to f in lower case
6
12
  HEX_LOWER = (0...16).to_a.map{ |n| n.to_s(16).downcase }
13
+
14
+ # Alphabet containing the 26 letters of the latin alphabet in upper case
7
15
  LATIN_UPPER = ('A'..'Z').to_a
16
+
17
+ # Alphabet containing the 26 letters of the latin alphabet in lower case
8
18
  LATIN_LOWER = ('a'..'z').to_a
19
+
20
+ # Alphabet containing the alphanumeric characters in upper case (A-Z, 0-9)
9
21
  ALPHA_UPPER = LATIN_UPPER + DEC
22
+
23
+ # Alphabet containing the alphanumeric characters in lower case (a-z, 0-9)
10
24
  ALPHA_LOWER = LATIN_LOWER + DEC
25
+
26
+ # Alphabet containing the alphanumeric characters (A-Z, a-z, 0-9)
11
27
  ALPHA = LATIN_UPPER + LATIN_LOWER + DEC
28
+
29
+ # Alphabet containing all printable ASCII characters (ASCII 32 - 127)
12
30
  PRINTABLE = (32...127).to_a.map(&:chr)
31
+
32
+ # Alphabet containing all ASCII characters
13
33
  ASCII = (0...128).to_a.map(&:chr)
34
+
35
+ # Alphabet containing all possible byte values, suitable for any input data
14
36
  BINARY = (0...256).to_a.map(&:chr)
15
37
 
16
- # Default presets
38
+ # Preset to satisfy the GIF specification
17
39
  PRESET_GIF = {
18
40
  min_bits: 8,
19
41
  max_bits: 12,
@@ -22,6 +44,8 @@ class LZWrb
22
44
  stop: true,
23
45
  deferred: true
24
46
  }
47
+
48
+ # Preset optimized for speed
25
49
  PRESET_FAST = {
26
50
  min_bits: 16,
27
51
  max_bits: 16,
@@ -29,6 +53,8 @@ class LZWrb
29
53
  clear: false,
30
54
  stop: false
31
55
  }
56
+
57
+ # Preset optimized for compression
32
58
  PRESET_BEST = {
33
59
  min_bits: 8,
34
60
  max_bits: 16,
@@ -46,27 +72,54 @@ class LZWrb
46
72
  debug: 4 # Print everything, including debug details about the encoding process
47
73
  }
48
74
 
49
- # Class default values (no NIL's here!)
50
- @@min_bits = 8 # Minimum code bit length
51
- @@max_bits = 16 # Maximum code bit length before rebuilding table
52
- @@lsb = true # Least significant bit first order
53
- @@clear = false # Use CLEAR codes
54
- @@stop = false # Use STOP codes
55
- @@deferred = false # Use deferred CLEAR codes
56
-
75
+ # Default value for minimum code bits (may be changed depending on alphabet)
76
+ DEFAULT_MIN_BITS = 8
77
+
78
+ # Default value for maximum code bits before rebuilding code table (may be changed depending on alphabet)
79
+ DEFAULT_MAX_BITS = 16
80
+
81
+ # Use Least Significant Bit packing order
82
+ DEFAULT_LSB = true
83
+
84
+ # Use explicit Clear codes to indicate the initialization of the code table
85
+ DEFAULT_CLEAR = false
86
+
87
+ # Use explicit Stop codes to indicate the end of the data
88
+ DEFAULT_STOP = false
89
+
90
+ # Enable the use of deferred Clear codes (the decoder won't rebuild the table, even when full, unless an explicit Clear code is received)
91
+ DEFAULT_DEFERRED = false
92
+
93
+ # Dictionary of binary chars for fast access
94
+ CHARS = 256.times.map(&:chr)
95
+
96
+ # Creates a new encoder/decoder object with the given settings.
97
+ # @param alphabet [Array<String>] Set of characters that compose the messages to encode.
98
+ # @param binary [Boolean] Use binary encoding or textual encoding.
99
+ # @param bits [Integer] Code bit size for constant length encoding (superseeds min/max bit size).
100
+ # @param clear [Boolean] Use clear codes every time the table gets reinitialized.
101
+ # @param deferred [Boolean] Support deferred clear codes when decoding (i.e., don't refresh code table unless an explicit clear code is received, even when it's full).
102
+ # @param lsb [Boolean] Use least or most significant bit packing (currently useless, only LSB supported).
103
+ # @param max_bits [Integer] Maximum code bit size for variable length encoding (superseeded by `bits`).
104
+ # @param min_bits [Integer] Minimum code bit size for variable length encoding (superseeded by `bits`).
105
+ # @param preset [Hash] Predefined configurations for a few settings (such as bit count or usage of clear/stop codes)
106
+ # @param safe [Boolean] First encoding pass to verify alphabet covers all data
107
+ # @param stop [Boolean] Use stop codes to denote the end of the encoded data.
108
+ # @param verbosity [Integer] Verbosity level of the encoder (see {LZWrb::VERBOSITY}).
109
+ # @return [LZWrb] The newly created encoder/decoder.
57
110
  def initialize(
58
- preset: nil, # Predefined configurations (GIF...)
59
- bits: nil, # Code bit size for constant length encoding (superseeds min/max bit size)
60
- min_bits: nil, # Minimum code bit size for variable length encoding (superseeded by 'bits')
61
- max_bits: nil, # Maximum code bit size for variable length encoding (superseeded by 'bits')
62
- binary: nil, # Use binary encoding (vs regular text encoding)
63
- alphabet: BINARY, # Set of characters that compose the messages to encode
64
- safe: false, # First encoding pass to verify alphabet covers all data
65
- lsb: nil, # Use least or most significant bit packing
66
- clear: nil, # Use clear codes every time the table gets reinitialized
67
- stop: nil, # Use stop codes at the end of the encoding
68
- deferred: nil, # Use deferred clear codes
69
- verbosity: :normal # Verbosity level of the encoder
111
+ preset: nil,
112
+ bits: nil,
113
+ min_bits: nil,
114
+ max_bits: nil,
115
+ binary: nil,
116
+ alphabet: BINARY,
117
+ safe: false,
118
+ lsb: nil,
119
+ clear: nil,
120
+ stop: nil,
121
+ deferred: nil,
122
+ verbosity: :normal
70
123
  )
71
124
  # Parse preset
72
125
  params = preset || {}
@@ -88,7 +141,8 @@ class LZWrb
88
141
  warn('Removed duplicate entries from alphabet') if @alphabet.size < alphabet.size
89
142
 
90
143
  # Binary compression
91
- @binary = binary.nil? ? alphabet == BINARY : binary
144
+ @binary = binary == false ? false : true
145
+ warn("Binary alphabet being used with textual mode, are you sure this is what you want?") if !@binary && @alphabet == BINARY
92
146
 
93
147
  # Safe mode for encoding (verifies that the data provided is composed exclusively
94
148
  # by characters from the alphabet)
@@ -104,8 +158,8 @@ class LZWrb
104
158
  @max_bits = bits
105
159
  end
106
160
  else
107
- @min_bits = find_arg(min_bits, params[:min_bits], @@min_bits)
108
- @max_bits = find_arg(max_bits, params[:max_bits], @@max_bits)
161
+ @min_bits = find_arg(min_bits, params[:min_bits], DEFAULT_MIN_BITS)
162
+ @max_bits = find_arg(max_bits, params[:max_bits], DEFAULT_MAX_BITS)
109
163
  if @max_bits < @min_bits
110
164
  warn("Max code size (#{@max_bits}) should be higher than min code size (#{@min_bits}): changed max code size to #{@min_bits}.")
111
165
  @max_bits = @min_bits
@@ -119,8 +173,8 @@ class LZWrb
119
173
  end
120
174
 
121
175
  # Clear and stop codes
122
- use_clear = find_arg(clear, params[:clear], @@clear)
123
- use_stop = find_arg(stop, params[:stop], @@stop)
176
+ use_clear = find_arg(clear, params[:clear], DEFAULT_CLEAR)
177
+ use_stop = find_arg(stop, params[:stop], DEFAULT_STOP)
124
178
  if !use_stop && @min_bits < 8
125
179
  use_stop = true
126
180
  # Warning if stop codes were explicitly disabled (false, NOT nil)
@@ -150,12 +204,15 @@ class LZWrb
150
204
  idx = @alphabet.size - 1
151
205
  @clear = use_clear ? idx += 1 : nil
152
206
  @stop = use_stop ? idx += 1 : nil
153
- @deferred = find_arg(deferred, params[:deferred], @@deferred)
207
+ @deferred = find_arg(deferred, params[:deferred], DEFAULT_DEFERRED)
154
208
 
155
209
  # Least/most significant bit packing order
156
- @lsb = find_arg(lsb, params[:lsb], @@lsb)
210
+ @lsb = find_arg(lsb, params[:lsb], DEFAULT_LSB)
157
211
  end
158
212
 
213
+ # Encode the provided data.
214
+ # @param data [String] Data to encode.
215
+ # @return [String] Encoded data.
159
216
  def encode(data)
160
217
  # Log
161
218
  log("<- Encoding #{format_size(data.bytesize)} with #{format_params}.")
@@ -167,9 +224,11 @@ class LZWrb
167
224
  verify_data(data) if @safe
168
225
 
169
226
  # LZW-encode data
170
- buf = ''
227
+ buf = ''.b
171
228
  put_code(@clear) if !@clear.nil?
172
- data.each_char do |c|
229
+ proc = -> (c) {
230
+ c = CHARS[c] if @binary
231
+ @count += 1
173
232
  next_buf = buf + c
174
233
  if table_has(next_buf)
175
234
  buf = next_buf
@@ -179,7 +238,8 @@ class LZWrb
179
238
  table_check()
180
239
  buf = c
181
240
  end
182
- end
241
+ }
242
+ @binary ? data.each_byte(&proc) : data.each_char(&proc)
183
243
  put_code(@table[buf])
184
244
  put_code(@stop) if !@stop.nil?
185
245
 
@@ -195,7 +255,9 @@ class LZWrb
195
255
  lex(e, 'Encoding error', true)
196
256
  end
197
257
 
198
- # Optimization? Unpack bits subsequently, rather than converting between strings and ints
258
+ # Decode the provided data.
259
+ # @param data [String] Data to decode.
260
+ # @return [String] Decoded data.
199
261
  def decode(data)
200
262
  # Log
201
263
  log("<- Decoding #{format_size(data.bytesize)} with #{format_params}.")
@@ -214,6 +276,7 @@ class LZWrb
214
276
  width = @bits
215
277
  while off + width <= len
216
278
  # Parse code
279
+ @count += 1
217
280
  code = bits[off ... off + width].reverse.to_i(2)
218
281
  off += width
219
282
 
@@ -246,7 +309,7 @@ class LZWrb
246
309
  ttime = Time.now - stime
247
310
  log("-> Decoding finished in #{"%.3fs" % [ttime]} (avg. #{"%.3f" % [(8.0 * data.bytesize / 1024 ** 2) / ttime]} mbit\/s).")
248
311
  log("-> Decoded data: #{format_size(out.bytesize)} (#{"%5.2f%%" % [100 * (1 - data.bytesize.to_f / out.bytesize)]} compression).")
249
- out
312
+ @binary ? out.force_encoding('ASCII-8BIT') : out.force_encoding('UTF-8')
250
313
  rescue => e
251
314
  lex(e, 'Decoding error', false)
252
315
  end
@@ -256,10 +319,11 @@ class LZWrb
256
319
  # Initialize buffers, needs to be called every time we execute a new
257
320
  # compression / decompression job
258
321
  def init(compress)
259
- @buffer = [] # Contains result of compression
260
- @boff = 0 # BIT offset of last buffer byte, for packing
261
- @compress = compress # Compression or decompression job
262
- @step = @compress ? 0 : 1 # Decoder is always 1 step behind the encoder
322
+ @count = 0 # Amount of codes generated
323
+ @buffer = [] # Contains result of compression
324
+ @boff = 0 # BIT offset of last buffer byte, for packing
325
+ @compress = compress # Compression or decompression job
326
+ @step = @compress ? 0 : 1 # Decoder is always 1 step behind the encoder
263
327
  end
264
328
 
265
329
  # < --------------------------- PARSING METHODS ---------------------------- >
@@ -326,6 +390,7 @@ class LZWrb
326
390
  end
327
391
 
328
392
  @bits = [@key.bit_length, @min_bits].max
393
+ dbg("Refreshed code table after #{@count} #{@compress ? 'characters' : 'codes'}.")
329
394
  end
330
395
 
331
396
  def table_has(val)
@@ -360,8 +425,13 @@ class LZWrb
360
425
  # < ------------------------- ENCODING METHODS --------------------------- >
361
426
 
362
427
  def verify_data(data)
363
- alph = @alphabet.each_with_index.to_h
364
- raise "Data contains characters not present in the alphabet" if data.each_char.any?{ |c| !alph.include?(c) }
428
+ data.send(@binary ? :bytes : :chars) do |c|
429
+ c = c.chr if @binary
430
+ if !@alphabet.include?(c)
431
+ c = c.force_encoding('UTF-8').scrub{ |b| '\x' + b.unpack('H*')[0].upcase } if @binary
432
+ raise "Data contains character not present in the alphabet: #{c}"
433
+ end
434
+ end
365
435
  end
366
436
 
367
437
  def put_code(code)
metadata CHANGED
@@ -1,20 +1,20 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: lzwrb
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - edelkas
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-02-02 00:00:00.000000000 Z
11
+ date: 2024-03-03 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description: |2
14
14
  This library provides LZW encoding and decoding capabilities with no
15
- dependencies and reasonably fast speed. It is highly configurable,
15
+ dependencies and a reasonably fast speed. It is highly configurable,
16
16
  supporting both constant and variable code lengths, custom alphabets,
17
- usage of clear/stop codes...
17
+ usage of clear/stop codes... It uses LSB packing order.
18
18
 
19
19
  It is compatible with the GIF specification, and comes equipped with
20
20
  several presets. Eventually I'd like to add compatibility with other
@@ -22,13 +22,21 @@ description: |2
22
22
  email:
23
23
  executables: []
24
24
  extensions: []
25
- extra_rdoc_files: []
25
+ extra_rdoc_files:
26
+ - README.md
27
+ - CHANGELOG.md
26
28
  files:
29
+ - ".yardopts"
30
+ - CHANGELOG.md
31
+ - README.md
27
32
  - lib/lzwrb.rb
28
33
  homepage: https://github.com/edelkas/lzwrb
29
34
  licenses: []
30
35
  metadata:
36
+ homepage_uri: https://github.com/edelkas/lzwrb
31
37
  source_code_uri: https://github.com/edelkas/lzwrb
38
+ documentation_uri: https://www.rubydoc.info/gems/lzwrb
39
+ changelog_uri: https://github.com/edelkas/lzwrb/blob/master/CHANGELOG.md
32
40
  post_install_message:
33
41
  rdoc_options: []
34
42
  require_paths:
@@ -44,8 +52,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
44
52
  - !ruby/object:Gem::Version
45
53
  version: '0'
46
54
  requirements: []
47
- rubygems_version: 3.1.6
55
+ rubygems_version: 3.1.2
48
56
  signing_key:
49
57
  specification_version: 4
50
- summary: Pury Ruby LZW encoder/decoder with a wide range of settings
58
+ summary: Pure Ruby LZW encoder/decoder, highly configurable, compatible with GIF
51
59
  test_files: []