philiprehberger-encoding_kit 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 7b463d6499d157c675b38d8fa5c49a7c9057735e23bd3e253970d2f4af8caab7
4
+ data.tar.gz: e341c544cdce4088ae8ac9066f5ca5a48b552c66a61246ca5b6fa65e753e3bb6
5
+ SHA512:
6
+ metadata.gz: 7b3e7bbbd6b36fb4526e759d394df560d79364bfb19b71ee2b317c583e007f7385ee1e417a41188557899de2b52750d0366706a34e014662fc5a2536360aa849
7
+ data.tar.gz: a737e7967708482920174711f2339c5b14f608dbd87284e983608ddfe49dea9195b6fb21b570c388ebbe4f1c19e9e6409e3b34c5d3249a3a2b5661b30eca8e94
data/CHANGELOG.md ADDED
@@ -0,0 +1,18 @@
1
+ # Changelog
2
+
3
+ All notable changes to this gem will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ## [0.1.0] - 2026-03-26
11
+
12
+ ### Added
13
+ - Initial release
14
+ - Encoding detection via BOM inspection and byte-pattern heuristics
15
+ - Conversion between encodings with fallback options
16
+ - UTF-8 normalization with replacement character support
17
+ - BOM detection and stripping
18
+ - Encoding validity checks
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 philiprehberger
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,118 @@
1
+ # philiprehberger-encoding_kit
2
+
3
+ [![Tests](https://github.com/philiprehberger/rb-encoding-kit/actions/workflows/ci.yml/badge.svg)](https://github.com/philiprehberger/rb-encoding-kit/actions/workflows/ci.yml)
4
+ [![Gem Version](https://badge.fury.io/rb/philiprehberger-encoding_kit.svg)](https://rubygems.org/gems/philiprehberger-encoding_kit)
5
+ [![License](https://img.shields.io/github/license/philiprehberger/rb-encoding-kit)](LICENSE)
6
+ [![Sponsor](https://img.shields.io/badge/sponsor-GitHub%20Sponsors-ec6cb9)](https://github.com/sponsors/philiprehberger)
7
+
8
+ Character encoding detection, conversion, and normalization
9
+
10
+ ## Requirements
11
+
12
+ - Ruby >= 3.1
13
+
14
+ ## Installation
15
+
16
+ Add to your Gemfile:
17
+
18
+ ```ruby
19
+ gem "philiprehberger-encoding_kit"
20
+ ```
21
+
22
+ Or install directly:
23
+
24
+ ```bash
25
+ gem install philiprehberger-encoding_kit
26
+ ```
27
+
28
+ ## Usage
29
+
30
+ ```ruby
31
+ require "philiprehberger/encoding_kit"
32
+
33
+ encoding = Philiprehberger::EncodingKit.detect(raw_bytes)
34
+ utf8 = Philiprehberger::EncodingKit.to_utf8(raw_bytes)
35
+ ```
36
+
37
+ ### Encoding Detection
38
+
39
+ ```ruby
40
+ require "philiprehberger/encoding_kit"
41
+
42
+ # Detects via BOM first, then UTF-8 validity, ASCII, Latin-1 heuristics
43
+ Philiprehberger::EncodingKit.detect("\xEF\xBB\xBFhello".b) # => Encoding::UTF_8
44
+ Philiprehberger::EncodingKit.detect("caf\xC3\xA9".b) # => Encoding::UTF_8
45
+ Philiprehberger::EncodingKit.detect("caf\xE9".b) # => Encoding::ISO_8859_1
46
+ ```
47
+
48
+ ### Convert to UTF-8
49
+
50
+ ```ruby
51
+ require "philiprehberger/encoding_kit"
52
+
53
+ # Auto-detect source encoding
54
+ utf8 = Philiprehberger::EncodingKit.to_utf8(raw_bytes)
55
+
56
+ # Specify source encoding
57
+ utf8 = Philiprehberger::EncodingKit.to_utf8(latin1_string, from: Encoding::ISO_8859_1)
58
+ ```
59
+
60
+ ### Normalize
61
+
62
+ ```ruby
63
+ require "philiprehberger/encoding_kit"
64
+
65
+ # Replace invalid/undefined bytes with U+FFFD
66
+ clean = Philiprehberger::EncodingKit.normalize("hello \xFF world".b)
67
+ ```
68
+
69
+ ### Convert Between Encodings
70
+
71
+ ```ruby
72
+ require "philiprehberger/encoding_kit"
73
+
74
+ latin1 = Philiprehberger::EncodingKit.convert(utf8_string, from: Encoding::UTF_8, to: Encoding::ISO_8859_1)
75
+ ```
76
+
77
+ ### BOM Handling
78
+
79
+ ```ruby
80
+ require "philiprehberger/encoding_kit"
81
+
82
+ Philiprehberger::EncodingKit.bom?("\xEF\xBB\xBFhello") # => true
83
+ Philiprehberger::EncodingKit.strip_bom("\xEF\xBB\xBFhello") # => "hello"
84
+ ```
85
+
86
+ ### Validity Check
87
+
88
+ ```ruby
89
+ require "philiprehberger/encoding_kit"
90
+
91
+ Philiprehberger::EncodingKit.valid?("hello") # => true
92
+ Philiprehberger::EncodingKit.valid?("\xFF\xFE".force_encoding("UTF-8")) # => false
93
+ Philiprehberger::EncodingKit.valid?("hello", encoding: Encoding::US_ASCII) # => true
94
+ ```
95
+
96
+ ## API
97
+
98
+ | Method | Description |
99
+ |--------|-------------|
100
+ | `EncodingKit.detect(string)` | Detect encoding via BOM and heuristics, returns an `Encoding` object |
101
+ | `EncodingKit.to_utf8(string, from: nil)` | Convert to UTF-8, auto-detect source if `from` is nil |
102
+ | `EncodingKit.normalize(string)` | Force to valid UTF-8, replacing bad bytes with U+FFFD |
103
+ | `EncodingKit.valid?(string, encoding: nil)` | Check if string is valid in given or current encoding |
104
+ | `EncodingKit.convert(string, from:, to:)` | Convert between arbitrary encodings |
105
+ | `EncodingKit.strip_bom(string)` | Remove byte order mark if present |
106
+ | `EncodingKit.bom?(string)` | Check if string starts with a BOM |
107
+
108
+ ## Development
109
+
110
+ ```bash
111
+ bundle install
112
+ bundle exec rspec
113
+ bundle exec rubocop
114
+ ```
115
+
116
+ ## License
117
+
118
+ [MIT](LICENSE)
@@ -0,0 +1,58 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Philiprehberger
4
+ module EncodingKit
5
+ # Encoding conversion with fallback handling
6
+ module Converter
7
+ class << self
8
+ # Convert a string from one encoding to another.
9
+ #
10
+ # @param string [String] the input string
11
+ # @param from [String, Encoding] source encoding
12
+ # @param to [String, Encoding] target encoding
13
+ # @param fallback [Symbol] fallback strategy (:replace or :raise)
14
+ # @param replace [String] replacement character for invalid bytes
15
+ # @return [String] the converted string
16
+ # @raise [EncodingKit::Error] on conversion failure when fallback is :raise
17
+ def convert(string, from:, to:, fallback: :replace, replace: '?')
18
+ source = Encoding.find(from.to_s)
19
+ target = Encoding.find(to.to_s)
20
+
21
+ str = string.dup.force_encoding(source)
22
+
23
+ if fallback == :replace
24
+ str.encode(target, invalid: :replace, undef: :replace, replace: replace)
25
+ else
26
+ str.encode(target)
27
+ end
28
+ rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError => e
29
+ raise Error, "Encoding conversion failed: #{e.message}"
30
+ end
31
+
32
+ # Convert a string to UTF-8, optionally auto-detecting the source encoding.
33
+ #
34
+ # @param string [String] the input string
35
+ # @param from [String, Encoding, nil] source encoding (auto-detect if nil)
36
+ # @return [String] UTF-8 encoded string
37
+ def to_utf8(string, from: nil)
38
+ source = from ? Encoding.find(from.to_s) : Detector.call(string)
39
+ str = string.dup.force_encoding(source)
40
+ str.encode(Encoding::UTF_8, invalid: :replace, undef: :replace, replace: "\uFFFD")
41
+ end
42
+
43
+ # Force a string to valid UTF-8 by replacing invalid and undefined bytes.
44
+ #
45
+ # @param string [String] the input string
46
+ # @return [String] valid UTF-8 string with replacement characters for bad bytes
47
+ def normalize(string)
48
+ str = string.dup
49
+ str.force_encoding(Encoding::UTF_8) if [Encoding::BINARY, Encoding::ASCII_8BIT].include?(str.encoding)
50
+
51
+ return str if str.encoding == Encoding::UTF_8 && str.valid_encoding?
52
+
53
+ str.encode(Encoding::UTF_8, str.encoding, invalid: :replace, undef: :replace, replace: "\uFFFD")
54
+ end
55
+ end
56
+ end
57
+ end
58
+ end
@@ -0,0 +1,81 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Philiprehberger
4
+ module EncodingKit
5
+ # Encoding detection via BOM inspection and byte-pattern heuristics
6
+ module Detector
7
+ # BOM signatures ordered from longest to shortest to avoid false matches
8
+ BOMS = [
9
+ ["\x00\x00\xFE\xFF".b, Encoding::UTF_32BE],
10
+ ["\xFF\xFE\x00\x00".b, Encoding::UTF_32LE],
11
+ ["\xEF\xBB\xBF".b, Encoding::UTF_8],
12
+ ["\xFE\xFF".b, Encoding::UTF_16BE],
13
+ ["\xFF\xFE".b, Encoding::UTF_16LE]
14
+ ].freeze
15
+
16
+ class << self
17
+ # Detect the encoding of a byte string.
18
+ #
19
+ # Strategy:
20
+ # 1. Check for a byte order mark (BOM)
21
+ # 2. Try UTF-8 validity
22
+ # 3. Check pure ASCII
23
+ # 4. Apply Latin-1 heuristic
24
+ # 5. Fall back to BINARY
25
+ #
26
+ # @param string [String] the input string (ideally with BINARY/ASCII-8BIT encoding)
27
+ # @return [Encoding] the detected encoding
28
+ def call(string)
29
+ bytes = string.b
30
+
31
+ bom_encoding = detect_bom(bytes)
32
+ return bom_encoding if bom_encoding
33
+
34
+ return Encoding::UTF_8 if valid_utf8?(bytes)
35
+ return Encoding::US_ASCII if ascii_only?(bytes)
36
+ return Encoding::ISO_8859_1 if latin1_heuristic?(bytes)
37
+
38
+ Encoding::BINARY
39
+ end
40
+
41
+ # Check whether the string starts with a known BOM.
42
+ #
43
+ # @param bytes [String] binary string
44
+ # @return [Encoding, nil] the encoding indicated by the BOM, or nil
45
+ def detect_bom(bytes)
46
+ BOMS.each do |bom, encoding|
47
+ return encoding if bytes.start_with?(bom)
48
+ end
49
+ nil
50
+ end
51
+
52
+ private
53
+
54
+ # @param bytes [String] binary string
55
+ # @return [Boolean]
56
+ def valid_utf8?(bytes)
57
+ dup = bytes.dup.force_encoding(Encoding::UTF_8)
58
+ dup.valid_encoding? && !ascii_only?(bytes)
59
+ end
60
+
61
+ # @param bytes [String] binary string
62
+ # @return [Boolean]
63
+ def ascii_only?(bytes)
64
+ bytes.each_byte.all? { |b| b < 128 }
65
+ end
66
+
67
+ # Simple heuristic: if every byte is in the ISO-8859-1 printable range
68
+ # (0x20..0x7E or 0xA0..0xFF) or is a common control character, treat as Latin-1.
69
+ #
70
+ # @param bytes [String] binary string
71
+ # @return [Boolean]
72
+ def latin1_heuristic?(bytes)
73
+ bytes.each_byte.all? do |b|
74
+ (0x20..0x7E).cover?(b) || (0xA0..0xFF).cover?(b) ||
75
+ b == 0x09 || b == 0x0A || b == 0x0D # tab, LF, CR
76
+ end
77
+ end
78
+ end
79
+ end
80
+ end
81
+ end
@@ -0,0 +1,7 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Philiprehberger
4
+ module EncodingKit
5
+ VERSION = '0.1.0'
6
+ end
7
+ end
@@ -0,0 +1,88 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'encoding_kit/version'
4
+ require_relative 'encoding_kit/detector'
5
+ require_relative 'encoding_kit/converter'
6
+
7
+ module Philiprehberger
8
+ module EncodingKit
9
+ class Error < StandardError; end
10
+
11
+ # BOM signatures (re-exported for public use)
12
+ BOMS = Detector::BOMS
13
+
14
+ # Detect the encoding of a string via BOM and heuristics.
15
+ #
16
+ # @param string [String] the input string
17
+ # @return [Encoding] the detected encoding
18
+ def self.detect(string)
19
+ Detector.call(string)
20
+ end
21
+
22
+ # Convert a string to UTF-8, auto-detecting source encoding if not specified.
23
+ #
24
+ # @param string [String] the input string
25
+ # @param from [String, Encoding, nil] source encoding (auto-detect if nil)
26
+ # @return [String] UTF-8 encoded string
27
+ def self.to_utf8(string, from: nil)
28
+ Converter.to_utf8(string, from: from)
29
+ end
30
+
31
+ # Normalize a string to valid UTF-8, replacing invalid/undefined bytes
32
+ # with the Unicode replacement character (U+FFFD).
33
+ #
34
+ # @param string [String] the input string
35
+ # @return [String] valid UTF-8 string
36
+ def self.normalize(string)
37
+ Converter.normalize(string)
38
+ end
39
+
40
+ # Check if a string is valid in the given encoding (or its current encoding).
41
+ #
42
+ # @param string [String] the input string
43
+ # @param encoding [String, Encoding, nil] encoding to check against (defaults to string's encoding)
44
+ # @return [Boolean]
45
+ def self.valid?(string, encoding: nil)
46
+ if encoding
47
+ enc = Encoding.find(encoding.to_s)
48
+ string.dup.force_encoding(enc).valid_encoding?
49
+ else
50
+ string.valid_encoding?
51
+ end
52
+ end
53
+
54
+ # Convert a string between encodings.
55
+ #
56
+ # @param string [String] the input string
57
+ # @param from [String, Encoding] source encoding
58
+ # @param to [String, Encoding] target encoding
59
+ # @return [String] the converted string
60
+ def self.convert(string, from:, to:)
61
+ Converter.convert(string, from: from, to: to)
62
+ end
63
+
64
+ # Remove a byte order mark from the beginning of a string.
65
+ #
66
+ # @param string [String] the input string
67
+ # @return [String] the string without a BOM
68
+ def self.strip_bom(string)
69
+ bytes = string.b
70
+ BOMS.each do |bom, _encoding| # rubocop:disable Style/HashEachMethods
71
+ if bytes.start_with?(bom)
72
+ result = bytes[bom.bytesize..]
73
+ return result.force_encoding(string.encoding)
74
+ end
75
+ end
76
+ string.dup
77
+ end
78
+
79
+ # Check if a string starts with a byte order mark.
80
+ #
81
+ # @param string [String] the input string
82
+ # @return [Boolean]
83
+ def self.bom?(string)
84
+ bytes = string.b
85
+ BOMS.any? { |bom, _encoding| bytes.start_with?(bom) }
86
+ end
87
+ end
88
+ end
metadata ADDED
@@ -0,0 +1,56 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: philiprehberger-encoding_kit
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Philip Rehberger
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2026-03-27 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: Detect encoding from BOM and heuristics, convert between encodings, normalize
14
+ to UTF-8, and strip byte order marks. Zero dependencies.
15
+ email:
16
+ - me@philiprehberger.com
17
+ executables: []
18
+ extensions: []
19
+ extra_rdoc_files: []
20
+ files:
21
+ - CHANGELOG.md
22
+ - LICENSE
23
+ - README.md
24
+ - lib/philiprehberger/encoding_kit.rb
25
+ - lib/philiprehberger/encoding_kit/converter.rb
26
+ - lib/philiprehberger/encoding_kit/detector.rb
27
+ - lib/philiprehberger/encoding_kit/version.rb
28
+ homepage: https://github.com/philiprehberger/rb-encoding-kit
29
+ licenses:
30
+ - MIT
31
+ metadata:
32
+ homepage_uri: https://github.com/philiprehberger/rb-encoding-kit
33
+ source_code_uri: https://github.com/philiprehberger/rb-encoding-kit
34
+ changelog_uri: https://github.com/philiprehberger/rb-encoding-kit/blob/main/CHANGELOG.md
35
+ bug_tracker_uri: https://github.com/philiprehberger/rb-encoding-kit/issues
36
+ rubygems_mfa_required: 'true'
37
+ post_install_message:
38
+ rdoc_options: []
39
+ require_paths:
40
+ - lib
41
+ required_ruby_version: !ruby/object:Gem::Requirement
42
+ requirements:
43
+ - - ">="
44
+ - !ruby/object:Gem::Version
45
+ version: 3.1.0
46
+ required_rubygems_version: !ruby/object:Gem::Requirement
47
+ requirements:
48
+ - - ">="
49
+ - !ruby/object:Gem::Version
50
+ version: '0'
51
+ requirements: []
52
+ rubygems_version: 3.5.22
53
+ signing_key:
54
+ specification_version: 4
55
+ summary: Character encoding detection, conversion, and normalization
56
+ test_files: []