RubyGems - unicode-display_width - Versions diffs - 3.1.1 → 3.1.3 - Mend

unicode-display_width 3.1.1 → 3.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +13 -1
data/README.md +6 -1
data/data/display_width.marshal.gz +0 -0
data/lib/unicode/display_width/constants.rb +1 -1
data/lib/unicode/display_width.rb +92 -106
metadata +3 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 1886b39340f01645fd4e64d032ab1611e471e60c4dbd46e3d8867125ef45232d
-  data.tar.gz: dc452df48efa0f7f9cd862b0390f611df83e4dcd82c640f4faa151b05022596f
+  metadata.gz: 4b0b5fe12467a22c6b21ad6dfb8dc422eb547e252690e432afc7504e8dae641c
+  data.tar.gz: a2c0e4c856034b1ef64946861d33845615cd8c950da462441faea2f900c14502
 SHA512:
-  metadata.gz: 85dfef303836ba1c13271144ad24f89dbc40591d9056ee187c9ee5b7b6ff1f19d8d9ebd4e21108f78a61d2e3d3c6ce44f560005e3216cf3f8fa595466c50dfc7
-  data.tar.gz: eae08ff81ed83a3965820aaf09ce10fd028428adf6653807d3fb87b0c96b9ae7be757d38ffa90a8a27b39a8454e4b2694109a8aaddcc75d21d0e449ce8f7f628
+  metadata.gz: 8a9499ffcdc0f6def0ac88fc13aaaaea0e46031a63e05c00c32872e1f066d7550abd8e0cb3efaea084a07ebc168e87ed2a3a32effa040c1a651c3288157704f1
+  data.tar.gz: f6a6c7e002476db323d52073ef00567c6eafb564864f1b9c6b72ba7228c5f1a0325e12a07445e0d7a7af66ff988ccd05e89d268898c78c2eaaf97735f79b3f90

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,16 @@
 # CHANGELOG
+## 3.1.3
+Better handling of non-UTF-8 strings, patch by @Earlopain:
+- Data with *BINARY* encoding is interpreted as UTF-8, if possible
+- Use `invalid: :replace` and `undef: :replace` options when converting to UTF-8
+## 3.1.2
+- Performance improvements
 ## 3.1.1
 - Performance improvements
@@ -11,7 +22,7 @@
 - Emoji modes: Differentiate between well-formed Emoji (`:possible`) and any
   ZWJ/modifier sequence (`:all`). The latter is more common and more efficient
   to implement.
-- Unify `rgi_{fqe,mqe,uqe}` options to just `:rgi` to keep things simpler (corresponds to
+- Unify `:rgi_{fqe,mqe,uqe}` options to just `:rgi` to keep things simpler (corresponds to
   the former `:rgi_uqe` option). Most terminals that want to support the RGI set
   will probably want to catch Emoji sequences with missing VS16s.
 - Add new `:all_no_vs16` and `:rgi_at` modes to be able to support some terminals
@@ -24,6 +35,7 @@
 ## 3.0.1
 - Add WezTerm and foot as good Emoji terminals
 ## 3.0.0

data/README.md CHANGED Viewed

@@ -71,6 +71,11 @@ Unicode::DisplayWidth.of("·", 1) # => 1
 Unicode::DisplayWidth.of("·", 2) # => 2
 ```
+### Encoding Notes
+- Data with *BINARY* encoding is interpreted as UTF-8, if possible
+- Non-UTF-8 strings are converted to UTF-8 before measuring, using the [`{invalid: :replace, undef: :replace}`) options](https://ruby-doc.org/3.3.5/encodings_rdoc.html#label-Encoding+Options)
 ### Custom Overwrites
 You can overwrite how to handle specific code points by passing a hash (or even a proc) as `overwrite:` parameter:
@@ -126,7 +131,7 @@ The `emoji:` option can be used to configure which type of Emoji should be consi
 Unfortunately, the level of Emoji support varies a lot between terminals. While some of them are able to display (almost) all Emoji sequences correctly, others fall back to displaying sequences of basic Emoji. When `emoji: true` or `emoji: :auto` is used, the gem will attempt to set the best fitting Emoji setting for you (e.g. `:rgi_at` on "Apple_Terminal" or `false` on Gnome's terminal widget).
-Please note that Emoji display and number of terminal columns used might differs a lot. For example, it might be the case that a terminal does not understand which Emoji to display, but still manages to calculate the proper amount of terminal cells. The automatic Emoji support level per terminal only considers the latter (cursor position), not the actual Emoji image(s) displayed. Please [open an issue](https://github.com/janlelis/unicode-display_width/issues/new) if you notice your terminal application could use a better default value. Also see the [ucs-detect project](https://ucs-detect.readthedocs.io/results.html), which is a great resource that compares various terminal's Unicode/Emoji capabilities.
+Please note that Emoji display and number of terminal columns used might differs a lot. For example, it might be the case that a terminal does not understand which Emoji to display, but still manages to calculate the proper amount of terminal cells. The automatic Emoji support level per terminal only considers the latter (cursor position), not the actual Emoji image(s) displayed. Please [open an issue](https://github.com/janlelis/unicode-display_width/issues/new) if you notice your terminal application could use a better default value. Also see the [ucs-detect project](https://ucs-detect.readthedocs.io/results.html), which is a great resource that compares various terminal's Unicode/Emoji capabilities. You can checkout how your terminals renders different kind of Emoji types with this [terminal-emoji-width.rb script](https://github.com/janlelis/unicode-display_width/blob/main/misc/terminal-emoji-width.rb).
 **To terminal implementors reading this:** Although the practice of giving all Emoji/ZWJ sequences a width of 2 (`:all` mode described above) has some advantages, it does not lead to a particularly good developer experience. Since there is always the possibility of well-formed Emoji that are currently not supported (non-RGI / future Unicode) appearing, those sequences will take more cells. Instead of overflowing, cutting off sequences or displaying placeholder-Emoji, could it be worthwile to implement the `:rgi` option (only known Emoji get width 2) and give those unknown Emoji the space they need? This would support the idea that the meaning of an unknown Emoji sequence can still be conveyed (without messing up the terminal at the same time). Just a thought…

data/data/display_width.marshal.gz CHANGED Viewed

Binary file

data/lib/unicode/display_width/constants.rb CHANGED Viewed

@@ -2,7 +2,7 @@
 module Unicode
   class DisplayWidth
-    VERSION = "3.1.1"
+    VERSION = "3.1.3"
     UNICODE_VERSION = "16.0.0"
     DATA_DIRECTORY = File.expand_path(File.dirname(__FILE__) + "/../../../data/")
     INDEX_FILENAME = DATA_DIRECTORY + "/display_width.marshal.gz"

data/lib/unicode/display_width.rb CHANGED Viewed

@@ -10,8 +10,8 @@ module Unicode
   class DisplayWidth
     DEFAULT_AMBIGUOUS = 1
     INITIAL_DEPTH = 0x10000
-    ASCII_NON_ZERO_REGEX = /[\0\x05\a\b\n\v\f\r\x0E\x0F]/
-    ASCII_NON_ZERO_STRING = "\0\x05\a\b\n\v\f\r\x0E\x0F"
+    ASCII_NON_ZERO_REGEX = /[\0\x05\a\b\n-\x0F]/
+    ASCII_NON_ZERO_STRING = "\0\x05\a\b\n-\x0F"
     ASCII_BACKSPACE = "\b"
     AMBIGUOUS_MAP = {
       1 => :WIDTH_ONE,
@@ -21,6 +21,10 @@ module Unicode
       WIDTH_ONE: 768,
       WIDTH_TWO: 161,
     }
+    NOT_COMMON_NARROW_REGEX = {
+     WIDTH_ONE: /[^\u{10}-\u{2FF}]/m,
+     WIDTH_TWO: /[^\u{10}-\u{A1}]/m,
+    }
     FIRST_4096 = {
       WIDTH_ONE: decompress_index(INDEX[:WIDTH_ONE][0][0], 1),
       WIDTH_TWO: decompress_index(INDEX[:WIDTH_TWO][0][0], 1),
@@ -30,7 +34,6 @@ module Unicode
       rgi_at: :REGEX_INCLUDE_MQE_UQE,
       possible: :REGEX_WELL_FORMED,
     }
-    REGEX_EMOJI_NOT_POSSIBLE = /\A[#*0-9]\z/
     REGEX_EMOJI_VS16 = Regexp.union(
       Regexp.compile(
         Unicode::Emoji::REGEX_TEXT_PRESENTATION.source +
@@ -44,120 +47,55 @@ module Unicode
     # Returns monospace display width of string
     def self.of(string, ambiguous = nil, overwrite = nil, old_options = {}, **options)
-      unless old_options.empty?
-        warn "Unicode::DisplayWidth: Please migrate to keyword arguments - #{old_options.inspect}"
-        options.merge! old_options
+      # Binary strings don't make much sense when calculating display width.
+      # Assume it's valid UTF-8
+      if string.encoding == Encoding::BINARY && !string.force_encoding(Encoding::UTF_8).valid_encoding?
+        # Didn't work out, go back to binary
+        string.force_encoding(Encoding::BINARY)
       end
-      options[:ambiguous] = ambiguous if ambiguous
-      options[:ambiguous] ||= DEFAULT_AMBIGUOUS
+      string = string.encode(Encoding::UTF_8, invalid: :replace, undef: :replace) unless string.encoding == Encoding::UTF_8
+      options = normalize_options(string, ambiguous, overwrite, old_options, **options)
-      if options[:ambiguous] != 1 && options[:ambiguous] != 2
-        raise ArgumentError, "Unicode::DisplayWidth: Ambiguous width must be 1 or 2"
-      end
+      width = 0
-      if overwrite && !overwrite.empty?
-        warn "Unicode::DisplayWidth: Please migrate to keyword arguments - overwrite: #{overwrite.inspect}"
-        options[:overwrite] = overwrite
+      unless options[:overwrite].empty?
+        width, string = width_custom(string, options[:overwrite])
       end
-      options[:overwrite] ||= {}
-      if [nil, true, :auto].include?(options[:emoji])
-        options[:emoji] = EmojiSupport.recommended
+      if string.ascii_only?
+        return width + width_ascii(string)
       end
-      # # #
-      if !options[:overwrite].empty?
-        return width_frame(string, options) do |string, index_full, index_low, first_ambiguous|
-          width_all_features(string, index_full, index_low, first_ambiguous, options[:overwrite])
-        end
-      end
-      if !string.ascii_only?
-        return width_frame(string, options) do |string, index_full, index_low, first_ambiguous|
-          width_no_overwrite(string, index_full, index_low, first_ambiguous)
-        end
-      end
-      width_ascii(string)
-    end
+      ambiguous_index_name = AMBIGUOUS_MAP[options[:ambiguous]]
-    def self.width_ascii(string)
-      # Optimization for ASCII-only strings without certain control symbols
-      if string.match?(ASCII_NON_ZERO_REGEX)
-        res = string.delete(ASCII_NON_ZERO_STRING).size - string.count(ASCII_BACKSPACE)
-        return res < 0 ? 0 : res
+      unless string.match?(NOT_COMMON_NARROW_REGEX[ambiguous_index_name])
+        return width + string.size
       end
-      # Pure ASCII
-      string.size
-    end
-    def self.width_frame(string, options)
       # Retrieve Emoji width
-      if options[:emoji] == false || options[:emoji] == :none
-        res = 0
-      else
-        res, string = emoji_width(
+      if options[:emoji] != :none
+        e_width, string = emoji_width(
           string,
           options[:emoji],
           options[:ambiguous],
         )
-      end
-      # Prepare indexes
-      ambiguous_index_name = AMBIGUOUS_MAP[options[:ambiguous]]
-      # Get general width
-      res += yield(string, INDEX[ambiguous_index_name], FIRST_4096[ambiguous_index_name], FIRST_AMBIGUOUS[ambiguous_index_name])
+        width += e_width
-      # Return result + prevent negative lengths
-      res < 0 ? 0 : res
-    end
-    def self.width_no_overwrite(string, index_full, index_low, first_ambiguous, _ = {})
-      res = 0
-      # Make sure we have UTF-8
-      string = string.encode(Encoding::UTF_8) unless string.encoding.name == "utf-8"
-      string.scan(/.{,80}/m){ |batch|
-        if batch.ascii_only?
-          res += batch.size
-        else
-          batch.each_codepoint{ |codepoint|
-            if codepoint > 15 && codepoint < first_ambiguous
-              res += 1
-            elsif codepoint < 0x1001
-              res += index_low[codepoint] || 1
-            else
-              d = INITIAL_DEPTH
-              w = index_full[codepoint / d]
-              while w.instance_of? Array
-                w = w[(codepoint %= d) / (d /= 16)]
-              end
-              res += w || 1
-            end
-          }
+        unless string.match?(NOT_COMMON_NARROW_REGEX[ambiguous_index_name])
+          return width + string.size
         end
-      }
+      end
-      res
-    end
-    # Same as .width_no_overwrite - but with applying overwrites for each char
-    def self.width_all_features(string, index_full, index_low, first_ambiguous, overwrite)
-      res = 0
+      index_full = INDEX[ambiguous_index_name]
+      index_low = FIRST_4096[ambiguous_index_name]
+      first_ambiguous = FIRST_AMBIGUOUS[ambiguous_index_name]
       string.each_codepoint{ |codepoint|
-        if overwrite[codepoint]
-          res += overwrite[codepoint]
-        elsif codepoint > 15 && codepoint < first_ambiguous
-          res += 1
+        if codepoint > 15 && codepoint < first_ambiguous
+          width += 1
         elsif codepoint < 0x1001
-          res += index_low[codepoint] || 1
+          width += index_low[codepoint] || 1
         else
           d = INITIAL_DEPTH
           w = index_full[codepoint / d]
@@ -165,19 +103,44 @@ module Unicode
             w = w[(codepoint %= d) / (d /= 16)]
           end
-          res += w || 1
+          width += w || 1
         end
       }
-      res
+      # Return result + prevent negative lengths
+      width < 0 ? 0 : width
     end
+    # Returns width of custom overwrites and remaining string
+    def self.width_custom(string, overwrite)
+      width = 0
+      string = string.each_codepoint.select{ |codepoint|
+        if overwrite[codepoint]
+          width += overwrite[codepoint]
+          nil
+        else
+          codepoint
+        end
+      }.pack("U*")
+      [width, string]
+    end
+    # Returns width for ASCII-only strings. Will consider zero-width control symbols.
+    def self.width_ascii(string)
+      if string.match?(ASCII_NON_ZERO_REGEX)
+        res = string.delete(ASCII_NON_ZERO_STRING).bytesize - string.count(ASCII_BACKSPACE)
+        return res < 0 ? 0 : res
+      end
+      string.bytesize
+    end
+    # Returns width of all considered Emoji and remaining string
     def self.emoji_width(string, mode = :all, ambiguous = DEFAULT_AMBIGUOUS)
       res = 0
-      string = string.encode(Encoding::UTF_8) unless string.encoding.name == "utf-8"
       if emoji_set_regex = EMOJI_SEQUENCES_REGEX_MAPPING[mode]
         emoji_width_via_possible(
           string,
@@ -209,13 +172,9 @@ module Unicode
       res = 0
       # For each string possibly an emoji
-      no_emoji_string = string.gsub(Unicode::Emoji::REGEX_POSSIBLE){ |emoji_candidate|
-        # Skip notorious false positives
-        if REGEX_EMOJI_NOT_POSSIBLE.match?(emoji_candidate)
-          emoji_candidate
+      no_emoji_string = string.gsub(REGEX_EMOJI_ALL_SEQUENCES_AND_VS16){ |emoji_candidate|
         # Check if we have a combined Emoji with width 2 (or EAW an Apple Terminal)
-        elsif emoji_candidate == emoji_candidate[emoji_set_regex]
+        if emoji_candidate == emoji_candidate[emoji_set_regex]
           if strict_eaw
             res += self.of(emoji_candidate[0], ambiguous, emoji: false)
           else
@@ -237,6 +196,34 @@ module Unicode
       [res, no_emoji_string]
     end
+    def self.normalize_options(string, ambiguous = nil, overwrite = nil, old_options = {}, **options)
+      unless old_options.empty?
+        warn "Unicode::DisplayWidth: Please migrate to keyword arguments - #{old_options.inspect}"
+        options.merge! old_options
+      end
+      options[:ambiguous] = ambiguous if ambiguous
+      options[:ambiguous] ||= DEFAULT_AMBIGUOUS
+      if options[:ambiguous] != 1 && options[:ambiguous] != 2
+        raise ArgumentError, "Unicode::DisplayWidth: Ambiguous width must be 1 or 2"
+      end
+      if overwrite && !overwrite.empty?
+        warn "Unicode::DisplayWidth: Please migrate to keyword arguments - overwrite: #{overwrite.inspect}"
+        options[:overwrite] = overwrite
+      end
+      options[:overwrite] ||= {}
+      if [nil, true, :auto].include?(options[:emoji])
+        options[:emoji] = EmojiSupport.recommended
+      elsif options[:emoji] == false
+        options[:emoji] = :none
+      end
+      options
+    end
     def initialize(ambiguous: DEFAULT_AMBIGUOUS, overwrite: {}, emoji: true)
       @ambiguous = ambiguous
       @overwrite = overwrite
@@ -256,4 +243,3 @@ module Unicode
     end
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: unicode-display_width
 version: !ruby/object:Gem::Version
-  version: 3.1.1
+  version: 3.1.3
 platform: ruby
 authors:
 - Jan Lelis
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-11-19 00:00:00.000000000 Z
+date: 2024-12-26 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: unicode-emoji
@@ -104,7 +104,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.5.21
+rubygems_version: 3.1.6
 signing_key:
 specification_version: 4
 summary: Determines the monospace display width of a string in Ruby.