RubyGems - lzwrb - Versions diffs - 0.1.0 → 0.2.1 - Mend

lzwrb 0.1.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 3ca54b9ac48840e65fc5d307707491fc3f06dae167a62fb14e1700877a169cc8
-  data.tar.gz: ca28f83f51533b04d093f79cf16f4a23cb02a646d496e48ef7f98802128290f8
+  metadata.gz: f77bcc2ae9deef70e973dff0baa710fa75b9bfd0d7c1e5ee5e2186f0c5fed787
+  data.tar.gz: 4cb924a627c753410d39fac4796fca9f818e1c99d04c651b13f74aebb79a9108
 SHA512:
-  metadata.gz: 0a146638c2720f8cc5b2e6d8af24f461ddad63d09eb4d86660897a73f7832cdf1439ef53aeb9e52754a536e5d36b5f4799b488864372195a0b5c47154db76b86
-  data.tar.gz: 663e70084a893fa73b210ead73e91656f16753f00da72ac00e7f1474e861f65f98b2b81e048959daa8938c0a227c64338f14c463526589573b4ccaad163da079
+  metadata.gz: 46f6c747615ceb721167b3a7643787ad69c345371ede65263cf62cf962b73ffe99f9a4671d289ec001368f084323c9be6cd5466cc522af6dcf14537fac904075
+  data.tar.gz: 16c6601c4cda01ad7294c77770f8236b0b2391c043a2d66e08b72061d4bace70d226d86c2862614d9234b170d15a2a3c57a8d7f6e7c6eab55c33dee248eec1b3

data/.yardopts ADDED Viewed

	@@ -0,0 +1 @@
1	+ lib/*/.rb - CHANGELOG.md

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,27 @@
+# Changelog
+### 0.2.1 (03/Feb/2024)
+Documentation fix.
+### 0.2.0 (03/Feb/2024)
+- Adds: Binary vs textual encoding modes.
+- Adds: Rake tests.
+- Adds: YARD docs.
+- Adds: Debug logging.
+- Fixes: Encoding of Unicode strings in binary mode.
+- Fixes: Safe guard.
+### 0.1.0 (02/Feb/2024)
+Initial release. Already includes:
+- Working encoding and decoding.
+- Constant and variable code length support.
+- Custom alphabet support.
+- Optional Clear/Stop codes support, including deferred Clear codes.
+- Only Least Significant Bit packing order.
+- Full GIF specs support.
+- Presets.

data/README.md ADDED Viewed

@@ -0,0 +1,285 @@
+# lzwrb
+[![Gem Version](https://img.shields.io/gem/v/lzwrb.svg)](https://rubygems.org/gems/lzwrb)
+[![Gem](https://img.shields.io/gem/dt/lzwrb.svg)](https://rubygems.org/gems/lzwrb)
+[![Documentation](https://img.shields.io/badge/docs-grey.svg)](https://www.rubydoc.info/gems/lzwrb/)
+lzwrb is a Ruby gem for LZW encoding and decoding. Main features:
+* Pure Ruby, no dependencies.
+* Highly configurable (constant/variable code length, bit packing order...). See [Configuration](#configuration).
+* Compatible with many LZW specifications, including GIF (see [Presets](#presets)).
+* Reasonably fast for pure Ruby (see [Benchmarks](#benchmarks)).
+See also: [Documentation](https://www.rubydoc.info/gems/lzwrb/), [Changelog](https://www.rubydoc.info/gems/lzwrb/file/CHANGELOG.md).
+## Table of contents
+- [What's LZW?](#what's-lzw)
+- [Installation](#installation)
+- [Basic Usage](#basic-usage)
+- [Configuration](#configuration)
+  - [Presets](#presets)
+  - [Binary vs textual](#binary-vs-textual)
+  - [Code length](#code-length)
+  - [Packing order](#packing-order)
+  - [Clear & Stop codes](#clear--stop-codes)
+  - [Custom alphabets](#custom-alphabets)
+  - [Verbosity](#verbosity)
+- [Benchmarks](#benchmarks)
+- [Todo](#todo)
+- [Notes](#notes)
+## What's LZW?
+In short, LZW is a dictionary-based lossless compression algorithm. The encoder scans the data sequentially, building a table of patterns and substituting them in the data by their corresponding code. The decoder, configured with the same settings, will recognize the same patterns and generate the same table, thus being able to recover the original data from the sequence of codes.
+The original algorithm used constant code lengths (typically 12 bits), but this is wasteful, specially for data that can be represented with significantly fewer bits, thus variable code length was introduced. Another common feature is the usage of special _Clear_ and _Stop_ codes: When the table is full, the encoder and decoder have to reinitialize it, but this can also be done at will using the Clear code. Likewise, the Stop code indicates the end of the data, which may actually be necessary to avoid ambiguity (see [Clear & Stop codes](#clear--stop-codes)). See [Configuration](#configuration) for a more in-depth description of these and other features, and refer to the [Wikipedia article](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch) for a more thorough explanation.
+Nowadays, LZW has largely been replaced by other algorithms that provide a better compression ratio, notably [Deflate](https://en.wikipedia.org/wiki/Deflate), implemented by the ubiquitous [zlib](https://en.wikipedia.org/wiki/Zlib) and formats like Zip or GZip, and used in a plethora of file formats, like PNG images. Deflate also provides a faster decoding speed, though encoding is usually slower. Nevertheless, LZW still has its place in more classic formats yet in use, like GIF and TIFF images, PDF files, or the classic UNIX compress utility.
+## Installation
+The setup procedure is completely standard.
+### Using Bundler
+With [Bundler](https://bundler.io/#getting-started) just add the following line to your Gemfile:
+```ruby
+gem 'lzwrb'
+```
+And then install via `bundle install`.
+### Using RubyGems
+With Ruby's package manager, just execute the following command on the terminal:
+```
+gem install lzwrb
+```
+You many need admin privileges (e.g. `sudo` in Linux). Beware of using sudo if you're using a version manager, like [RVM](https://rvm.io/) or [rbenv](https://github.com/rbenv/rbenv).
+## Basic Usage
+Require the gem, create an `LZWrb` object with the desired settings and run the `encode` and `decode` methods.
+The following example uses the default configuration, see the next section for a list of all the available settings, which can be provided to the constructor via keyword arguments.
+```ruby
+require 'lzwrb'
+lzw = LZWrb.new
+data = 'TOBEORNOTTOBEORTOBEORNOT' * 10000
+puts lzw.decode(lzw.encode(data)) == data.b
+```
+Output sample:
+```
+[09:41:19.628] LZW <- Encoding 234.375KiB with 8-16 bit codes, LSB packing, no special codes, binary mode.
+[09:41:19.767] LZW -> Encoding finished in 0.139s (avg. 13.172 mbit/s).
+[09:41:19.767] LZW -> Encoded data: 4.458KiB (98.10% compression).
+[09:41:19.767] LZW <- Decoding 4.458KiB with 8-16 bit codes, LSB packing, no special codes, binary mode.
+[09:41:19.773] LZW -> Decoding finished in 0.006s (avg. 5.431 mbit/s).
+[09:41:19.773] LZW -> Decoded data: 234.375KiB (98.10% compression).
+true
+```
+Note: The output can be reduced or suppressed (or detailed), see [Verbosity](#verbosity).
+## Configuration
+Most of the usual options for the LZW algorithm can be configured individually by supplying the appropriate keyword arguments in the constructor. Additionally, there are several presets available, i.e., a selection of setting values with a specific purpose (e.g. best compression, fastest encoding, compatible with GIF...). Examples:
+```ruby
+lzw1 = LZWrb.new(preset: PRESET_GIF)
+lzw2 = LZWrb.new(min_bits: 8, max_bits: 16)
+lzw3 = LZWrb.new(preset: PRESET_FAST, clear: true, stop: true)
+```
+Individual settings may be used together with a preset, in which case the individual setting takes preference over the value that may be set by the preset, thus enabling the fine-tuning of a specific preset.
+The following options are available:
+Argument     | Type    | Default   | Description
+------------ | ------- | --------- | ---
+`:alphabet`  | Array   | `BINARY`  | List of characters that compose the data to encode. See [Custom alphabets](#custom-alphabets).
+`:binary`    | Boolean | True      | Whether to use binary or textual mode. See [Binary vs textual](#binary-vs-textual).
+`:bits`      | Integer | None      | Code length in bits, for constant length mode. See [Code length](#code-length).
+`:clear`     | Boolean | False     | Whether to use Clear codes or not. See [Clear & Stop codes](#clear--stop-codes).
+`:deferred`  | Boolean | False     | Whether to use deferred Clear codes for the decoding process. See [Clear & Stop codes](#clear--stop-codes).
+`:lsb`       | Boolean | True      | Whether to use LSB or MSB (least/most significant bit packing order). See [Packing order](#packing-order).
+`:max_bits`  | Integer | 16        | Maximum code length in bits, for variable length mode. See [Code length](#code-length).
+`:min_bits`  | Integer | 8         | Minimum code length in bits, for variable length mode. See [Code length](#code-length).
+`:preset`    | Symbol  | None      | Specifies which preset (set of options) to use. See [Presets](#presets).
+`:safe`      | Boolean | False     | Perform first pass during encoding process for data integrity verification. See [Custom alphabets](#custom-alphabets).
+`:stop`      | Boolean | False     | Whether to use Stop codes or not. See [Clear & Stop codes](#clear--stop-codes).
+`:verbosity` | Symbol  | `:normal` | Specifies the amount of detail to print to the console. See [Verbosity](#verbosity).
+### Presets
+Presets specify the value for several configuration options at the same time, with a specific goal in mind (typically, to optimize a certain aspect, or to be compatible with certain format). They can be used together with individual settings, in which case the latter take preference.
+The following presets are currently available:
+Preset | Description
+--- | ---
+`PRESET_BEST` | Aimed at best compression, uses variable code length (8-16 bits), no special codes and LSB packing.
+`PRESET_FAST` | Aimed at fastest encoding, uses a constant code length of 16 bits, no special codes and LSB packing.
+`PRESET_GIF`  | This is the exact specification implemented by the GIF format, uses variable code length (8-12 bits), clear and stop codes, and LSB packing.
+Their descriptive names are based on the average case. However, it is possible on strange samples for the `:best` compression to actually be worse than many other settings.
+For example, on highly random samples, most patterns are very short, perhaps only 1 or 2 characters long, and thus substituting them with a long code can end up being counterproductive. In these cases, a smaller code length might be preferable. In fact, on this kind of data LZW encoding may end up increasing the size, and thus not being suitable at all.
+### Binary vs textual
+The encoder and decoder may be run in both `binary` (default) or `textual` mode. Binary mode encodes the bytes of the string, whereas textual mode encodes its characters. Using binary mode with the (default) `BINARY` alphabet (of length 256) suffices to encode any arbitrary input, whereas with textual mode, one has to ensure the provided alphabet includes all the characters used in the input (this could prove tricky for arbitrary Unicode text). On the plus side, textual mode could attain higher compression for Unicode text, given that each character will be assigned a single code.
+If in doubt, it is recommended to leave the default values for these settings (binary mode and binary alphabet), since it will work out of the box for any input. Note that binary mode is *always* the default, **even** when one of the custom text alphabets is used. Thus, if textual mode wants to be enforced, it needs to be set using the `binary: false` setting, and a suitable alphabet must also be specified (see [Custom alphabets](#custom-alphabets)).
+The result of the encoding process is always returned as a binary string (i.e., `ASCII-8BIT`-encoded), whereas the result of the decoding process is either a binary string or a Unicode string (i.e., `UTF-8`-encoded), depending on which mode was selected - binary or textual - respectively.
+### Code length
+The different code length parameters specify how many bits are used when packing each of the codes generated by the LZW algorithm into the final output string during the encoding process.
+The encoder and the decoder need to agree on these values beforehand, otherwise, the decoder will just see garbled binary data when trying to read an encoded stream. This does not pose a problem when using this gem, since the configuration of the `LZWrb` object can only be done during initialization, so as long as both encoding and decoding are done with the same object, the configuration will match.
+Classically, two different modes are common, *constant length* and *variable length*.
+- Constant length mode uses the same bit size for all generated codes, and can be configured by setting the `:bits` parameter to whatever positive integer is desired. It has the disadvantage that there are either too few codes available (if the length is small), leading to poor compression and performance (due to the table needing to be refreshed often); or if the length is big to prevent these issues, then all codes take up substantial storage, again potentially leading to poor compression ratios.
+- Variable length mode was introduced to mitigate these issues. It will start with a smaller code size, resulting in more efficiently packed codes, and it will progressively increase the length as required when the table gets full, until eventually reaching a maximum size that will trigger a table refresh, as is the case for the previous mode. To configure this mode, the parameters `:min_bits` and `:max_bits` are provided.
+If parameters for both modes are provided, then the constant `:bits` will take priority. Variable length mode typically provides a better compression ratio for most data samples, though the specific optimal values depend on the nature of the data, and experimention might be suited to find them. On the other hand, a constant length of 8, 16, 32 or 64 bits provides the fastest speed, since all codes are then aligned with byte boundaries and code packing can be performed with native byte operations.
+The LZW algorithm requires that the code table be initialized with all 1 character strings first, i.e., the [alphabet](#custom-alphabets) must be included in the table. Therefore, the maximum code length must be big enough to hold the alphabet, and it is also irrelevant if the minimum code size is chosen to be smaller than the alphabet's size, as the first codes will nevertheless take up as many bits as required. This will all be corrected by the gem if the user inputs incorrect code lengths.
+It is recommended that the maximum code length be at least several bits higher than the size of the alphabet. For example, the default alphabet (binary) requires 8 bits (so that every possible 256 byte values can be encoded), and thus the maximum length should be several bits higher than 8. Otherwise, there will be a significant compression, penalty, as well as a performance penalty, due to the table needing to be refreshed very often due to the lack of slots. It is also pointless in this case to set the minimum code length below 8 bits.
+The presets provide good ball-park figures for reasonable code lengths, ranging from 8 to 16 bits for arbitrary binary data. If a smaller alphabet is selected, say, a 5 bit one (32 entries, enough to fit all upper-case latin letters, for instance), then smaller lengths could be reasonable as well.
+### Packing order
+As explained, the arbitrary code lengths imply that the resulting codes need to be aligned with byte boundaries, and might take multiple bytes. It is therefore important to select how to pack the corresponding bits of the codes into the resulting output string. Two standard schemes are common:
+- *Least significant bit* (LSB) packing order will align the least significant bit of the code with the least significant bit available in the last byte of the output, leaving the rest, most significant bits, for the subsequent bytes.
+- *Most significant bit* (MSB) packing order will align the most significant bit of the code with the most significant bit available in the last byte of the output, leaving the rest, least significant bits, for subsequent bytes.
+Currently, only LSB packing order is implemented, and thus, the `:lsb` configuration parameter is, for now, a placeholder, with the feature being planned.
+Both packing orders are commonplace, with LSB being used in formats like GIF and utilities like UNIX compress, and MSB being used in formats like PDF and TIFF, for instance.
+### Clear & Stop codes
+Another common feature of the LZW algorithm is the addition of special codes, not corresponding to any specific pattern of the data but, instead, to certain control instructions for the decoder.
+- The **CLEAR** code specifies to the decoder that the code table should be refreshed immediately. This feature is not technically needed, since the encoder and decoder must agree on all parameters anyways, and therefore are "synchronized", which means they will both refresh the table at the exact same time (whenever it's needed due to the table becoming full).
+  However, this feature allows an intelligent encoder (which this one is *not*) to detect when the current table no longer properly reflects the structure of the data, and would benefit from refreshing the table and getting rid of the old codes, thus also reducing the code size (if variable length mode is used) and boosting the performance (codes are found sooner on smaller tables). This encoding feature could be particularly useful for data with significant pattern changes.
+  The encoder/decoder can be configured to use clear codes with the `:clear` option. Since the encoder is not intelligent, this option simply means that the encoder will output a clear code before refreshing the table when it's full, thus adding a little overhead. It is therefore only useful for compatibility with formats that require the usage of clear codes, like GIF.
+- The **STOP** code indicates the end of the data, and that the algorithm should therefore terminate. It is only necessary in two scenarios:
+  - If the code length is smaller than 8 bits. In this case, a long string of 0's at the end of the final byte is ambiguous, as it could either be just padding, or actually represent code 0. Therefore, the encoder will force stop codes when the minimum code size is below 8 bits.
+  - For compatibility with formats that require stop codes, like GIF.
+  The encoder/decoder can be configured to use stop codes with the `:stop` option. As mentioned, stop codes might be used even this option is not set, or set to false, if the encoder deems it necessary (a warning will be issued in those cases).
+Another relevant option is **deferred clear codes**. An intelligent encoder can choose not to refresh the code table even when it's full, thus relying on the current one, without adding any new codes, until it deems it useful, at which point it will output a clear code as a signal for the decoder. A decoder would then have to be configured accordingly, such that it doesn't refresh the code table automatically - which is the standard behaviour - unless it receives an explicit clear code.
+This feature can be set using the `:deferred` option during initialization. It will only affect the decoder, since the encoder is not intelligent and will therefore never choose to output clear codes prematurely nor in deferred fashion, but rather, precisely when the code table is full. Nevertheless, it is necessary if one wants to decode data that was encoded with this feature on.
+Some formats, like GIF, employ this feature. In fact, it caused a great deal of confusion with developers, so much so that the cover of the GIF89 specification deals with this explanation. To this day, many pieces of software cannot properly decode GIF files that use deferred clear codes, since their decoder automatically refreshes the code table whenever it's full.
+### Custom alphabets
+As mentioned before, this algorithm can work for arbitrary data composed of characters from any character set, or *alphabet*. By default, this gem will use a binary alphabet of size 256 containing all possible byte values, which, therefore, is suitable to encode and decode any arbitrary piece of data.
+It is the recommended alphabet to use if in doubt or if binary data is meant to be encoded. Even if the data to be encoded is just text, if it is Unicode it is probably still best to use the default binary alphabet.
+Nevertheless, if data composed by a substantially smaller set of symbols is meant to be encoded, the full alphabet might be overkill, and a smaller alphabet could be better suited, leading to better compression due to the potential to use smaller code lengths. Unicode text can also benefit from a custom alphabet which includes all characters used in it, provided textual mode is used, since in that case, each character will be assigned a single compression code, rather than up to 4.
+This can be configured with the `:alphabet` option of the constructor, which receives an array with all the characters the data is (supposedly) composed of.
+For example, if the data to be encoded is composed only of bytes 0 through 7, the array
+```ruby
+["\x00", "\x01", "\x02", "\x03", "\x04", "\x05", "\x06", "\x07"]
+```
+could be used as the alphabet, and a minimum code length of only 3 bits could be set. The gem comes equipped with a few default alphabets as constants, for instance, `HEX_UPPER` contains all 16 possible hex digits in uppercase, and the default alphabet, `BINARY`, contains all possible 256 byte values. See the `LZWrb` class for a full list. Nevertheless, an arbitrary array can be used here, as long as it's composed only of 1-character strings.
+Sample input:
+```ruby
+lzw = LZWrb.new(alphabet: LZWrb::LATIN_UPPER)
+data = 'TOBEORNOTTOBEORTOBEORNOT' * 10000
+puts lzw.decode(lzw.encode(data)) == data
+```
+Output:
+```
+[09:48:25.765] LZW <- Encoding 234.375KiB with 5-16 bit codes, LSB packing, STOP codes, textual mode.
+[09:48:25.899] LZW -> Encoding finished in 0.133s (avg. 13.717 mbit/s).
+[09:48:25.899] LZW -> Encoded data: 4.330KiB (98.15% compression).
+[09:48:25.899] LZW <- Decoding 4.330KiB with 5-16 bit codes, LSB packing, STOP codes, textual mode.
+[09:48:25.909] LZW -> Decoding finished in 0.010s (avg. 3.272 mbit/s).
+[09:48:25.909] LZW -> Decoded data: 234.375KiB (98.15% compression).
+```
+Note how the encoder automatically selected a minimum code size of 5 bits, which suffices to hold the specified alphabet. Specifying fewer minimum bits doesn't make sense, and will be silently corrected by the encoder as well.
+Note that if the provided alphabet does not contain all the symbols that compose the data to be encoded, this will result in unexpected and incorrect behaviour; most likely an exception, although in some rare cases it could just result in garbled data. Therefore, the user should do either of the following:
+- Ensure the data is composed solely by characters from the alphabet. This will never be a problem if the default alphabet (binary) is used, since that can be used to encode arbitrary data. As mentioned, this is the recommended option unless the data can clearly benefit from a smaller alphabet.
+- Use the `:safe` option in the constructor, which will always perform a first pass before encoding to ensure that the provided data is indeed composed only of characters from the alphabet. This option, naturally, incurs in a performance penalty for the encoding process (see [Benchmarks](#benchmarks)) that depends on the length of the data. Thus, whenever speed is crucial, the first option is preferable.
+### Verbosity
+The amount of information logged to the terminal by the encoder and decoder can be set using the `:verbosity` option of the constructor, which takes any of the following symbols:
+Value | Description
+--- | ---
+`:silent` | Don't log anything.
+`:minimal` | Log only errors.
+`:quiet` | Log errors and warnings.
+`:normal` | Also log brief encoding/decoding info (time, compression ratio, speed...).
+`:debug` | Also log additional debug information of the encoding / decoding process, such as when the code table gets refreshed, or the traces of the exceptions, if any.
+The default value is `:normal`.
+## Benchmarks
+On mid-range computers (2nd gen Ryzen 5, 7th gen i5) I've obtained the following average speeds:
+- A low bound for encoding speeds of 5 mbit/s when the data is random, and thus, very hard to encode.
+- A high bound for encoding speeds of 16 mbit/s when the data is very regular, and thus, easy to encode.
+- A fairly consistent decoding speed of 8-9 mbit/s, regardless of the nature of the original data.
+The optional (and by default disabled) `:safe` setting, which performs a first pass in the encoding process to ensure data integrity (see [Custom alphabets](#custom-alphabets)), seems to incur roughly in a 15% performance penalty, i.e., encoding bitrate is decreased by about 15%. This setting does not affect the decoding process.
+## Todo
+Eventually I'd like to have the following extra features tested and implemented:
+### Main features
+* Most significant bit (MSB) packing order.
+* "Early change" support.
+* Support for TIFF and PDF (requires the 2 features above).
+* Support for UNIX compress (has a bug).
+### Smaller additions
+* Option to disable automatic table reinitialization.
+* Choice between plain and rich logs.
+### Optimizations
+* Native packing for codes of constant width of 8, 16, 32 or 64 bits for extra speed.
+* Try using a Trie instead of a Hash for the encoding process.
+## Notes
+* **Concurrency**: The `LZWrb` objects are _not_ thread safe. If you want to encode/decode in parallel, you must create a separate object for each thread, even if they use the same exact configuration.
+* **Custom alphabets**: If you change the default alphabet (binary), ensure the data to be encoded can be expressed solely with that alphabet, or alternatively, specify the `safe` option to the encoder.

data/lib/lzwrb.rb CHANGED Viewed

@@ -1,19 +1,41 @@
+# Holds an LZW encoder/decoder with a specific configuration. These objects are not thread safe, so for
+# concurrent use different ones should be created.
 class LZWrb
-  # Default alphabets
+  # Alphabet containing the digits 0 to 9
   DEC         = (0...10).to_a.map(&:chr)
+  # Alphabet containing the hex digits 0 to F in uppercase
   HEX_UPPER   = (0...16).to_a.map{ |n| n.to_s(16).upcase }
+  # Alphabet containing the hex digits 0 to f in lower case
   HEX_LOWER   = (0...16).to_a.map{ |n| n.to_s(16).downcase }
+  # Alphabet containing the 26 letters of the latin alphabet in upper case
   LATIN_UPPER = ('A'..'Z').to_a
+  # Alphabet containing the 26 letters of the latin alphabet in lower case
   LATIN_LOWER = ('a'..'z').to_a
+  # Alphabet containing the alphanumeric characters in upper case (A-Z, 0-9)
   ALPHA_UPPER = LATIN_UPPER + DEC
+  # Alphabet containing the alphanumeric characters in lower case (a-z, 0-9)
   ALPHA_LOWER = LATIN_LOWER + DEC
+  # Alphabet containing the alphanumeric characters (A-Z, a-z, 0-9)
   ALPHA       = LATIN_UPPER + LATIN_LOWER + DEC
+  # Alphabet containing all printable ASCII characters (ASCII 32 - 127)
   PRINTABLE   = (32...127).to_a.map(&:chr)
+  # Alphabet containing all ASCII characters
   ASCII       = (0...128).to_a.map(&:chr)
+  # Alphabet containing all possible byte values, suitable for any input data
   BINARY      = (0...256).to_a.map(&:chr)
-  # Default presets
+  # Preset to satisfy the GIF specification
   PRESET_GIF = {
     min_bits: 8,
     max_bits: 12,
@@ -22,6 +44,8 @@ class LZWrb
     stop:     true,
     deferred: true
   }
+  # Preset optimized for speed
   PRESET_FAST = {
     min_bits: 16,
     max_bits: 16,
@@ -29,6 +53,8 @@ class LZWrb
     clear:    false,
     stop:     false
   }
+  # Preset optimized for compression
   PRESET_BEST = {
     min_bits: 8,
     max_bits: 16,
@@ -46,27 +72,51 @@ class LZWrb
     debug:   4  # Print everything, including debug details about the encoding process
   }
-  # Class default values (no NIL's here!)
-  @@min_bits = 8     # Minimum code bit length
-  @@max_bits = 16    # Maximum code bit length before rebuilding table
-  @@lsb      = true  # Least significant bit first order
-  @@clear    = false # Use CLEAR codes
-  @@stop     = false # Use STOP codes
-  @@deferred = false # Use deferred CLEAR codes
+  # Default value for minimum code bits (may be changed depending on alphabet)
+  DEFAULT_MIN_BITS = 8
+  # Default value for maximum code bits before rebuilding code table (may be changed depending on alphabet)
+  DEFAULT_MAX_BITS = 16
+  # Use Least Significant Bit packing order
+  DEFAULT_LSB      = true
+  # Use explicit Clear codes to indicate the initialization of the code table
+  DEFAULT_CLEAR    = false
+  # Use explicit Stop codes to indicate the end of the data
+  DEFAULT_STOP     = false
+  # Enable the use of deferred Clear codes (the decoder won't rebuild the table, even when full, unless an explicit Clear code is received)
+  DEFAULT_DEFERRED = false
+  # Creates a new encoder/decoder object with the given settings.
+  # @param alphabet [Array<String>] Set of characters that compose the messages to encode.
+  # @param binary [Boolean] Use binary encoding or textual encoding.
+  # @param bits [Integer] Code bit size for constant length encoding (superseeds min/max bit size).
+  # @param clear [Boolean] Use clear codes every time the table gets reinitialized.
+  # @param deferred [Boolean] Support deferred clear codes when decoding (i.e., don't refresh code table unless an explicit clear code is received, even when it's full).
+  # @param lsb [Boolean] Use least or most significant bit packing (currently useless, only LSB supported).
+  # @param max_bits [Integer] Maximum code bit size for variable length encoding (superseeded by `bits`).
+  # @param min_bits [Integer] Minimum code bit size for variable length encoding (superseeded by `bits`).
+  # @param preset [Hash] Predefined configurations for a few settings (such as bit count or usage of clear/stop codes)
+  # @param safe [Boolean] First encoding pass to verify alphabet covers all data
+  # @param stop [Boolean] Use stop codes to denote the end of the encoded data.
+  # @param verbosity [Integer] Verbosity level of the encoder (see {LZWrb::VERBOSITY}).
+  # @return [LZWrb] The newly created encoder/decoder.
   def initialize(
-      preset:    nil,     # Predefined configurations (GIF...)
-      bits:      nil,     # Code bit size for constant length encoding (superseeds min/max bit size)
-      min_bits:  nil,     # Minimum code bit size for variable length encoding (superseeded by 'bits')
-      max_bits:  nil,     # Maximum code bit size for variable length encoding (superseeded by 'bits')
-      binary:    nil,     # Use binary encoding (vs regular text encoding)
-      alphabet:  BINARY,  # Set of characters that compose the messages to encode
-      safe:      false,   # First encoding pass to verify alphabet covers all data
-      lsb:       nil,     # Use least or most significant bit packing
-      clear:     nil,     # Use clear codes every time the table gets reinitialized
-      stop:      nil,     # Use stop codes at the end of the encoding
-      deferred:  nil,     # Use deferred clear codes
-      verbosity: :normal  # Verbosity level of the encoder
+      preset:    nil,
+      bits:      nil,
+      min_bits:  nil,
+      max_bits:  nil,
+      binary:    nil,
+      alphabet:  BINARY,
+      safe:      false,
+      lsb:       nil,
+      clear:     nil,
+      stop:      nil,
+      deferred:  nil,
+      verbosity: :normal
     )
     # Parse preset
     params = preset || {}
@@ -88,7 +138,8 @@ class LZWrb
     warn('Removed duplicate entries from alphabet') if @alphabet.size < alphabet.size
     # Binary compression
-    @binary = binary.nil? ? alphabet == BINARY : binary
+    @binary = binary == false ? false : true
+    warn("Binary alphabet being used with textual mode, are you sure this is what you want?") if !@binary && @alphabet == BINARY
     # Safe mode for encoding (verifies that the data provided is composed exclusively
     # by characters from the alphabet)
@@ -104,8 +155,8 @@ class LZWrb
         @max_bits = bits
       end
     else
-      @min_bits = find_arg(min_bits, params[:min_bits], @@min_bits)
-      @max_bits = find_arg(max_bits, params[:max_bits], @@max_bits)
+      @min_bits = find_arg(min_bits, params[:min_bits], DEFAULT_MIN_BITS)
+      @max_bits = find_arg(max_bits, params[:max_bits], DEFAULT_MAX_BITS)
       if @max_bits < @min_bits
         warn("Max code size (#{@max_bits}) should be higher than min code size (#{@min_bits}): changed max code size to #{@min_bits}.")
         @max_bits = @min_bits
@@ -119,8 +170,8 @@ class LZWrb
     end
     # Clear and stop codes
-    use_clear = find_arg(clear, params[:clear], @@clear)
-    use_stop = find_arg(stop, params[:stop], @@stop)
+    use_clear = find_arg(clear, params[:clear], DEFAULT_CLEAR)
+    use_stop = find_arg(stop, params[:stop], DEFAULT_STOP)
     if !use_stop && @min_bits < 8
       use_stop = true
       # Warning if stop codes were explicitly disabled (false, NOT nil)
@@ -150,12 +201,15 @@ class LZWrb
     idx = @alphabet.size - 1
     @clear = use_clear ? idx += 1 : nil
     @stop = use_stop ? idx += 1 : nil
-    @deferred = find_arg(deferred, params[:deferred], @@deferred)
+    @deferred = find_arg(deferred, params[:deferred], DEFAULT_DEFERRED)
     # Least/most significant bit packing order
-    @lsb = find_arg(lsb, params[:lsb], @@lsb)
+    @lsb = find_arg(lsb, params[:lsb], DEFAULT_LSB)
   end
+  # Encode the provided data.
+  # @param data [String] Data to encode.
+  # @return [String] Encoded data.
   def encode(data)
     # Log
     log("<- Encoding #{format_size(data.bytesize)} with #{format_params}.")
@@ -169,7 +223,9 @@ class LZWrb
     # LZW-encode data
     buf = ''
     put_code(@clear) if !@clear.nil?
-    data.each_char do |c|
+    proc = -> (c) {
+      c = c.chr if @binary
+      @count += 1
       next_buf = buf + c
       if table_has(next_buf)
         buf = next_buf
@@ -179,7 +235,8 @@ class LZWrb
         table_check()
         buf = c
       end
-    end
+    }
+    @binary ? data.each_byte(&proc) : data.each_char(&proc)
     put_code(@table[buf])
     put_code(@stop) if !@stop.nil?
@@ -195,7 +252,9 @@ class LZWrb
     lex(e, 'Encoding error', true)
   end
-  # Optimization? Unpack bits subsequently, rather than converting between strings and ints
+  # Decode the provided data.
+  # @param data [String] Data to decode.
+  # @return [String] Decoded data.
   def decode(data)
     # Log
     log("<- Decoding #{format_size(data.bytesize)} with #{format_params}.")
@@ -214,6 +273,7 @@ class LZWrb
     width = @bits
     while off + width <= len
       # Parse code
+      @count += 1
       code = bits[off ... off + width].reverse.to_i(2)
       off += width
@@ -246,7 +306,7 @@ class LZWrb
     ttime = Time.now - stime
     log("-> Decoding finished in #{"%.3fs" % [ttime]} (avg. #{"%.3f" % [(8.0 * data.bytesize / 1024 ** 2) / ttime]} mbit\/s).")
     log("-> Decoded data: #{format_size(out.bytesize)} (#{"%5.2f%%" % [100 * (1 - data.bytesize.to_f / out.bytesize)]} compression).")
-    out
+    @binary ? out.force_encoding('ASCII-8BIT') : out.force_encoding('UTF-8')
   rescue => e
     lex(e, 'Decoding error', false)
   end
@@ -256,10 +316,11 @@ class LZWrb
   # Initialize buffers, needs to be called every time we execute a new
   # compression / decompression job
   def init(compress)
-    @buffer = []              # Contains result of compression
-    @boff = 0                 # BIT offset of last buffer byte, for packing
-    @compress = compress      # Compression or decompression job
-    @step = @compress ? 0 : 1 # Decoder is always 1 step behind the encoder
+    @count    = 0                 # Amount of codes generated
+    @buffer   = []                # Contains result of compression
+    @boff     = 0                 # BIT offset of last buffer byte, for packing
+    @compress = compress          # Compression or decompression job
+    @step     = @compress ? 0 : 1 # Decoder is always 1 step behind the encoder
   end
   # < --------------------------- PARSING METHODS ---------------------------- >
@@ -326,6 +387,7 @@ class LZWrb
     end
     @bits = [@key.bit_length, @min_bits].max
+    dbg("Refreshed code table after #{@count} #{@compress ? 'characters' : 'codes'}.")
   end
   def table_has(val)
@@ -360,8 +422,13 @@ class LZWrb
   # < ------------------------- ENCODING METHODS --------------------------- >
   def verify_data(data)
-    alph = @alphabet.each_with_index.to_h
-    raise "Data contains characters not present in the alphabet" if data.each_char.any?{ |c| !alph.include?(c) }
+    data.send(@binary ? :bytes : :chars) do |c|
+      c = c.chr if @binary
+      if !@alphabet.include?(c)
+        c = c.force_encoding('UTF-8').scrub{ |b| '\x' + b.unpack('H*')[0].upcase } if @binary
+        raise "Data contains character not present in the alphabet: #{c}"
+      end
+    end
   end
   def put_code(code)

metadata CHANGED Viewed

@@ -1,20 +1,20 @@
 --- !ruby/object:Gem::Specification
 name: lzwrb
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.1
 platform: ruby
 authors:
 - edelkas
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-02-02 00:00:00.000000000 Z
+date: 2024-02-03 00:00:00.000000000 Z
 dependencies: []
 description: |2
       This library provides LZW encoding and decoding capabilities with no
-      dependencies and reasonably fast speed. It is highly configurable,
+      dependencies and a reasonably fast speed. It is highly configurable,
       supporting both constant and variable code lengths, custom alphabets,
-      usage of clear/stop codes...
+      usage of clear/stop codes... It uses LSB packing order.
       It is compatible with the GIF specification, and comes equipped with
       several presets. Eventually I'd like to add compatibility with other
@@ -24,11 +24,17 @@ executables: []
 extensions: []
 extra_rdoc_files: []
 files:
+- ".yardopts"
+- CHANGELOG.md
+- README.md
 - lib/lzwrb.rb
 homepage: https://github.com/edelkas/lzwrb
 licenses: []
 metadata:
+  homepage_uri: https://github.com/edelkas/lzwrb
   source_code_uri: https://github.com/edelkas/lzwrb
+  documentation_uri: https://www.rubydoc.info/gems/lzwrb
+  changelog_uri: https://www.rubydoc.info/gems/lzwrb/file/CHANGELOG.md
 post_install_message:
 rdoc_options: []
 require_paths:
@@ -44,8 +50,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.6
+rubygems_version: 3.1.2
 signing_key:
 specification_version: 4
-summary: Pury Ruby LZW encoder/decoder with a wide range of settings
+summary: Pure Ruby LZW encoder/decoder, highly configurable, compatible with GIF
 test_files: []