RubyGems - philiprehberger-encoding_kit - Versions diffs - 0.1.0 - Mend

philiprehberger-encoding_kit 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +7 -0
data/CHANGELOG.md +18 -0
data/LICENSE +21 -0
data/README.md +118 -0
data/lib/philiprehberger/encoding_kit/converter.rb +58 -0
data/lib/philiprehberger/encoding_kit/detector.rb +81 -0
data/lib/philiprehberger/encoding_kit/version.rb +7 -0
data/lib/philiprehberger/encoding_kit.rb +88 -0
metadata +56 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 7b463d6499d157c675b38d8fa5c49a7c9057735e23bd3e253970d2f4af8caab7
+  data.tar.gz: e341c544cdce4088ae8ac9066f5ca5a48b552c66a61246ca5b6fa65e753e3bb6
+SHA512:
+  metadata.gz: 7b3e7bbbd6b36fb4526e759d394df560d79364bfb19b71ee2b317c583e007f7385ee1e417a41188557899de2b52750d0366706a34e014662fc5a2536360aa849
+  data.tar.gz: a737e7967708482920174711f2339c5b14f608dbd87284e983608ddfe49dea9195b6fb21b570c388ebbe4f1c19e9e6409e3b34c5d3249a3a2b5661b30eca8e94

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,18 @@
+# Changelog
+All notable changes to this gem will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+## [0.1.0] - 2026-03-26
+### Added
+- Initial release
+- Encoding detection via BOM inspection and byte-pattern heuristics
+- Conversion between encodings with fallback options
+- UTF-8 normalization with replacement character support
+- BOM detection and stripping
+- Encoding validity checks

data/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 philiprehberger
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,118 @@
+# philiprehberger-encoding_kit
+[![Tests](https://github.com/philiprehberger/rb-encoding-kit/actions/workflows/ci.yml/badge.svg)](https://github.com/philiprehberger/rb-encoding-kit/actions/workflows/ci.yml)
+[![Gem Version](https://badge.fury.io/rb/philiprehberger-encoding_kit.svg)](https://rubygems.org/gems/philiprehberger-encoding_kit)
+[![License](https://img.shields.io/github/license/philiprehberger/rb-encoding-kit)](LICENSE)
+[![Sponsor](https://img.shields.io/badge/sponsor-GitHub%20Sponsors-ec6cb9)](https://github.com/sponsors/philiprehberger)
+Character encoding detection, conversion, and normalization
+## Requirements
+- Ruby >= 3.1
+## Installation
+Add to your Gemfile:
+```ruby
+gem "philiprehberger-encoding_kit"
+```
+Or install directly:
+```bash
+gem install philiprehberger-encoding_kit
+```
+## Usage
+```ruby
+require "philiprehberger/encoding_kit"
+encoding = Philiprehberger::EncodingKit.detect(raw_bytes)
+utf8 = Philiprehberger::EncodingKit.to_utf8(raw_bytes)
+```
+### Encoding Detection
+```ruby
+require "philiprehberger/encoding_kit"
+# Detects via BOM first, then UTF-8 validity, ASCII, Latin-1 heuristics
+Philiprehberger::EncodingKit.detect("\xEF\xBB\xBFhello".b) # => Encoding::UTF_8
+Philiprehberger::EncodingKit.detect("caf\xC3\xA9".b)       # => Encoding::UTF_8
+Philiprehberger::EncodingKit.detect("caf\xE9".b)            # => Encoding::ISO_8859_1
+```
+### Convert to UTF-8
+```ruby
+require "philiprehberger/encoding_kit"
+# Auto-detect source encoding
+utf8 = Philiprehberger::EncodingKit.to_utf8(raw_bytes)
+# Specify source encoding
+utf8 = Philiprehberger::EncodingKit.to_utf8(latin1_string, from: Encoding::ISO_8859_1)
+```
+### Normalize
+```ruby
+require "philiprehberger/encoding_kit"
+# Replace invalid/undefined bytes with U+FFFD
+clean = Philiprehberger::EncodingKit.normalize("hello \xFF world".b)
+```
+### Convert Between Encodings
+```ruby
+require "philiprehberger/encoding_kit"
+latin1 = Philiprehberger::EncodingKit.convert(utf8_string, from: Encoding::UTF_8, to: Encoding::ISO_8859_1)
+```
+### BOM Handling
+```ruby
+require "philiprehberger/encoding_kit"
+Philiprehberger::EncodingKit.bom?("\xEF\xBB\xBFhello")       # => true
+Philiprehberger::EncodingKit.strip_bom("\xEF\xBB\xBFhello")  # => "hello"
+```
+### Validity Check
+```ruby
+require "philiprehberger/encoding_kit"
+Philiprehberger::EncodingKit.valid?("hello")                                # => true
+Philiprehberger::EncodingKit.valid?("\xFF\xFE".force_encoding("UTF-8"))     # => false
+Philiprehberger::EncodingKit.valid?("hello", encoding: Encoding::US_ASCII)  # => true
+```
+## API
+| Method | Description |
+|--------|-------------|
+| `EncodingKit.detect(string)` | Detect encoding via BOM and heuristics, returns an `Encoding` object |
+| `EncodingKit.to_utf8(string, from: nil)` | Convert to UTF-8, auto-detect source if `from` is nil |
+| `EncodingKit.normalize(string)` | Force to valid UTF-8, replacing bad bytes with U+FFFD |
+| `EncodingKit.valid?(string, encoding: nil)` | Check if string is valid in given or current encoding |
+| `EncodingKit.convert(string, from:, to:)` | Convert between arbitrary encodings |
+| `EncodingKit.strip_bom(string)` | Remove byte order mark if present |
+| `EncodingKit.bom?(string)` | Check if string starts with a BOM |
+## Development
+```bash
+bundle install
+bundle exec rspec
+bundle exec rubocop
+```
+## License
+[MIT](LICENSE)

data/lib/philiprehberger/encoding_kit/converter.rb ADDED Viewed

@@ -0,0 +1,58 @@
+# frozen_string_literal: true
+module Philiprehberger
+  module EncodingKit
+    # Encoding conversion with fallback handling
+    module Converter
+      class << self
+        # Convert a string from one encoding to another.
+        #
+        # @param string [String] the input string
+        # @param from [String, Encoding] source encoding
+        # @param to [String, Encoding] target encoding
+        # @param fallback [Symbol] fallback strategy (:replace or :raise)
+        # @param replace [String] replacement character for invalid bytes
+        # @return [String] the converted string
+        # @raise [EncodingKit::Error] on conversion failure when fallback is :raise
+        def convert(string, from:, to:, fallback: :replace, replace: '?')
+          source = Encoding.find(from.to_s)
+          target = Encoding.find(to.to_s)
+          str = string.dup.force_encoding(source)
+          if fallback == :replace
+            str.encode(target, invalid: :replace, undef: :replace, replace: replace)
+          else
+            str.encode(target)
+          end
+        rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError => e
+          raise Error, "Encoding conversion failed: #{e.message}"
+        end
+        # Convert a string to UTF-8, optionally auto-detecting the source encoding.
+        #
+        # @param string [String] the input string
+        # @param from [String, Encoding, nil] source encoding (auto-detect if nil)
+        # @return [String] UTF-8 encoded string
+        def to_utf8(string, from: nil)
+          source = from ? Encoding.find(from.to_s) : Detector.call(string)
+          str = string.dup.force_encoding(source)
+          str.encode(Encoding::UTF_8, invalid: :replace, undef: :replace, replace: "\uFFFD")
+        end
+        # Force a string to valid UTF-8 by replacing invalid and undefined bytes.
+        #
+        # @param string [String] the input string
+        # @return [String] valid UTF-8 string with replacement characters for bad bytes
+        def normalize(string)
+          str = string.dup
+          str.force_encoding(Encoding::UTF_8) if [Encoding::BINARY, Encoding::ASCII_8BIT].include?(str.encoding)
+          return str if str.encoding == Encoding::UTF_8 && str.valid_encoding?
+          str.encode(Encoding::UTF_8, str.encoding, invalid: :replace, undef: :replace, replace: "\uFFFD")
+        end
+      end
+    end
+  end
+end

data/lib/philiprehberger/encoding_kit/detector.rb ADDED Viewed

@@ -0,0 +1,81 @@
+# frozen_string_literal: true
+module Philiprehberger
+  module EncodingKit
+    # Encoding detection via BOM inspection and byte-pattern heuristics
+    module Detector
+      # BOM signatures ordered from longest to shortest to avoid false matches
+      BOMS = [
+        ["\x00\x00\xFE\xFF".b, Encoding::UTF_32BE],
+        ["\xFF\xFE\x00\x00".b, Encoding::UTF_32LE],
+        ["\xEF\xBB\xBF".b, Encoding::UTF_8],
+        ["\xFE\xFF".b,         Encoding::UTF_16BE],
+        ["\xFF\xFE".b,         Encoding::UTF_16LE]
+      ].freeze
+      class << self
+        # Detect the encoding of a byte string.
+        #
+        # Strategy:
+        #   1. Check for a byte order mark (BOM)
+        #   2. Try UTF-8 validity
+        #   3. Check pure ASCII
+        #   4. Apply Latin-1 heuristic
+        #   5. Fall back to BINARY
+        #
+        # @param string [String] the input string (ideally with BINARY/ASCII-8BIT encoding)
+        # @return [Encoding] the detected encoding
+        def call(string)
+          bytes = string.b
+          bom_encoding = detect_bom(bytes)
+          return bom_encoding if bom_encoding
+          return Encoding::UTF_8 if valid_utf8?(bytes)
+          return Encoding::US_ASCII if ascii_only?(bytes)
+          return Encoding::ISO_8859_1 if latin1_heuristic?(bytes)
+          Encoding::BINARY
+        end
+        # Check whether the string starts with a known BOM.
+        #
+        # @param bytes [String] binary string
+        # @return [Encoding, nil] the encoding indicated by the BOM, or nil
+        def detect_bom(bytes)
+          BOMS.each do |bom, encoding|
+            return encoding if bytes.start_with?(bom)
+          end
+          nil
+        end
+        private
+        # @param bytes [String] binary string
+        # @return [Boolean]
+        def valid_utf8?(bytes)
+          dup = bytes.dup.force_encoding(Encoding::UTF_8)
+          dup.valid_encoding? && !ascii_only?(bytes)
+        end
+        # @param bytes [String] binary string
+        # @return [Boolean]
+        def ascii_only?(bytes)
+          bytes.each_byte.all? { |b| b < 128 }
+        end
+        # Simple heuristic: if every byte is in the ISO-8859-1 printable range
+        # (0x20..0x7E or 0xA0..0xFF) or is a common control character, treat as Latin-1.
+        #
+        # @param bytes [String] binary string
+        # @return [Boolean]
+        def latin1_heuristic?(bytes)
+          bytes.each_byte.all? do |b|
+            (0x20..0x7E).cover?(b) || (0xA0..0xFF).cover?(b) ||
+              b == 0x09 || b == 0x0A || b == 0x0D # tab, LF, CR
+          end
+        end
+      end
+    end
+  end
+end

data/lib/philiprehberger/encoding_kit/version.rb ADDED Viewed

@@ -0,0 +1,7 @@
+# frozen_string_literal: true
+module Philiprehberger
+  module EncodingKit
+    VERSION = '0.1.0'
+  end
+end

data/lib/philiprehberger/encoding_kit.rb ADDED Viewed

@@ -0,0 +1,88 @@
+# frozen_string_literal: true
+require_relative 'encoding_kit/version'
+require_relative 'encoding_kit/detector'
+require_relative 'encoding_kit/converter'
+module Philiprehberger
+  module EncodingKit
+    class Error < StandardError; end
+    # BOM signatures (re-exported for public use)
+    BOMS = Detector::BOMS
+    # Detect the encoding of a string via BOM and heuristics.
+    #
+    # @param string [String] the input string
+    # @return [Encoding] the detected encoding
+    def self.detect(string)
+      Detector.call(string)
+    end
+    # Convert a string to UTF-8, auto-detecting source encoding if not specified.
+    #
+    # @param string [String] the input string
+    # @param from [String, Encoding, nil] source encoding (auto-detect if nil)
+    # @return [String] UTF-8 encoded string
+    def self.to_utf8(string, from: nil)
+      Converter.to_utf8(string, from: from)
+    end
+    # Normalize a string to valid UTF-8, replacing invalid/undefined bytes
+    # with the Unicode replacement character (U+FFFD).
+    #
+    # @param string [String] the input string
+    # @return [String] valid UTF-8 string
+    def self.normalize(string)
+      Converter.normalize(string)
+    end
+    # Check if a string is valid in the given encoding (or its current encoding).
+    #
+    # @param string [String] the input string
+    # @param encoding [String, Encoding, nil] encoding to check against (defaults to string's encoding)
+    # @return [Boolean]
+    def self.valid?(string, encoding: nil)
+      if encoding
+        enc = Encoding.find(encoding.to_s)
+        string.dup.force_encoding(enc).valid_encoding?
+      else
+        string.valid_encoding?
+      end
+    end
+    # Convert a string between encodings.
+    #
+    # @param string [String] the input string
+    # @param from [String, Encoding] source encoding
+    # @param to [String, Encoding] target encoding
+    # @return [String] the converted string
+    def self.convert(string, from:, to:)
+      Converter.convert(string, from: from, to: to)
+    end
+    # Remove a byte order mark from the beginning of a string.
+    #
+    # @param string [String] the input string
+    # @return [String] the string without a BOM
+    def self.strip_bom(string)
+      bytes = string.b
+      BOMS.each do |bom, _encoding| # rubocop:disable Style/HashEachMethods
+        if bytes.start_with?(bom)
+          result = bytes[bom.bytesize..]
+          return result.force_encoding(string.encoding)
+        end
+      end
+      string.dup
+    end
+    # Check if a string starts with a byte order mark.
+    #
+    # @param string [String] the input string
+    # @return [Boolean]
+    def self.bom?(string)
+      bytes = string.b
+      BOMS.any? { |bom, _encoding| bytes.start_with?(bom) }
+    end
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,56 @@
+--- !ruby/object:Gem::Specification
+name: philiprehberger-encoding_kit
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Philip Rehberger
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2026-03-27 00:00:00.000000000 Z
+dependencies: []
+description: Detect encoding from BOM and heuristics, convert between encodings, normalize
+  to UTF-8, and strip byte order marks. Zero dependencies.
+email:
+- me@philiprehberger.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- CHANGELOG.md
+- LICENSE
+- README.md
+- lib/philiprehberger/encoding_kit.rb
+- lib/philiprehberger/encoding_kit/converter.rb
+- lib/philiprehberger/encoding_kit/detector.rb
+- lib/philiprehberger/encoding_kit/version.rb
+homepage: https://github.com/philiprehberger/rb-encoding-kit
+licenses:
+- MIT
+metadata:
+  homepage_uri: https://github.com/philiprehberger/rb-encoding-kit
+  source_code_uri: https://github.com/philiprehberger/rb-encoding-kit
+  changelog_uri: https://github.com/philiprehberger/rb-encoding-kit/blob/main/CHANGELOG.md
+  bug_tracker_uri: https://github.com/philiprehberger/rb-encoding-kit/issues
+  rubygems_mfa_required: 'true'
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: 3.1.0
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.5.22
+signing_key:
+specification_version: 4
+summary: Character encoding detection, conversion, and normalization
+test_files: []