RubyGems - tre_regex - Versions diffs - 0.2.0-x86-linux-gnu → 0.2.2-x86-linux-gnu - Mend

tre_regex 0.2.0-x86-linux-gnu → 0.2.2-x86-linux-gnu

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 2492401d2afd79d20d04b98b539787d649135547ea83b7f206a11b5c2890aafb
-  data.tar.gz: 5595f5ad86f74329a36bcc264c9382218b9862e8361731c43ede8e35f77e0232
+  metadata.gz: 49641fedd90a6ef3b9be258fc7542c8d07c3f5dd05751545be95f6a890327620
+  data.tar.gz: 3f7d5ccf25ba01f8647e7a177896f795ee584a62eec6ca51224cbe7e78b99871
 SHA512:
-  metadata.gz: 6b2c53c803a3c9eebe4e9fc6e6d16c1e2ef1eeed71e79f723c451546609d93504aca7c4b1c63e6358932986a4f8f29aca0a93c8fe0685a2bc4804d9e41e72dd3
-  data.tar.gz: f94de19f598616fe19d1ef8d5ae3429859f4bd0e691e3c7a295c09c9d584f2a39475751d31b33faf0f37f904d1f6c918bc89217e5167fa57d0117ba3e6c368de
+  metadata.gz: 0f5c5250ddcbae4bde28a0d51db3f5a432afa4a0c10d5bc8965bfda79b38ba75907a00e58ef6615b6f10de81b4b46b5851da0a4dca796a8c19c64d2cd7a9487b
+  data.tar.gz: 2f67d19b1aa19123f0b32750766cf38b897c3ea01d601cd5e4d2d23f095fe3581dc8806299e28d83a51ca077ae15c636a4ab71ee1d5bd9e1913fe4d4e709f19c

data/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # TreRegex [![Ruby Checks](https://github.com/le0pard/tre_regex/actions/workflows/main.yml/badge.svg)](https://github.com/le0pard/tre_regex/actions/workflows/main.yml)
-`TreRegex` is a robust Ruby gem that provides a high-performance interface to the [TRE](https://github.com/laurikari/tre) approximate regex matching library. Powered by FFI, it allows you to perform lightning-fast fuzzy string searching while safely handling Ruby's Unicode characters.
+`TreRegex` provides a high-performance Ruby interface to the [TRE](https://github.com/laurikari/tre) C library using FFI. It brings robust approximate (fuzzy) regular expression matching to Ruby, featuring multi-byte Unicode string safety, and granular error limits
 ## Why?
@@ -87,10 +87,10 @@ regex = TreRegex::Regex.new('cat')
 # Returns an array of match hashes
 regex.match_all('cat, cot, cut', max_errors: 1).to_a
 # => [
-#   {match: "cat", submatches: [], index: 0, end_index: 3, cost: 0, errors: {insertions: 0, deletions: 0, substitutions: 0}},
+#  {match: "cat", submatches: [], index: 0, end_index: 3, cost: 0, errors: {insertions: 0, deletions: 0, substitutions: 0}},
 #  {match: "cot", submatches: [], index: 5, end_index: 8, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},
 #  {match: "cut", submatches: [], index: 10, end_index: 13, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}}
-#    ]
+# ]
 ```
 ### Capture Groups (Submatches)
@@ -240,10 +240,10 @@ regex = TreRegex::Regex.new('cat')
 # but it also matches "" at the end of the string (3 deletions)!
 regex.match_all('cot, cow', max_errors: 3).to_a
 # => [
-#     {match: "cot", submatches: [], index: 0, end_index: 3, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},
-#     {match: "cow", submatches: [], index: 5, end_index: 8, cost: 2, errors: {insertions: 0, deletions: 0, substitutions: 2}},
-#     {match: "", submatches: [], index: 8, end_index: 8, cost: 3, errors: {insertions: 0, deletions: 3, substitutions: 0}}
-#    ]
+#  {match: "cot", submatches: [], index: 0, end_index: 3, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},
+#  {match: "cow", submatches: [], index: 5, end_index: 8, cost: 2, errors: {insertions: 0, deletions: 0, substitutions: 2}},
+#  {match: "", submatches: [], index: 8, end_index: 8, cost: 3, errors: {insertions: 0, deletions: 3, substitutions: 0}}
+# ]
 ```
 **Best Practice**: if you need a high `max_errors` limit but want to prevent the engine from matching empty strings, explicitly cap the `max_deletions` option so that at least one character of your pattern must survive
@@ -252,9 +252,9 @@ regex.match_all('cot, cow', max_errors: 3).to_a
 # Allow 3 total errors, but strictly forbid the engine from deleting more than 2 characters
 regex.match_all('cot, cow', max_errors: 3, max_deletions: 2).to_a
 # => [
-#     {match: "cot", submatches: [], index: 0, end_index: 3, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},
-#     {match: "cow", submatches: [], index: 5, end_index: 8, cost: 2, errors: {insertions: 0, deletions: 0, substitutions: 2}}
-#    ] # The empty match is mathematically prevented
+#  {match: "cot", submatches: [], index: 0, end_index: 3, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},
+#  {match: "cow", submatches: [], index: 5, end_index: 8, cost: 2, errors: {insertions: 0, deletions: 0, substitutions: 2}}
+# ] # The empty match is mathematically prevented
 ```
 ### POSIX vs. PCRE Syntax
@@ -316,6 +316,22 @@ If you need to find overlapping fuzzy matches, you will need to manually step th
 ## Development
+Because `TreRegex` compiles the underlying TRE C-library from source, you must have standard C-compilation and `autotools` dependencies installed on your machine before running the setup script
+**Ubuntu / Debian Linux**
+```bash
+sudo apt-get update
+sudo apt-get install build-essential autoconf automake libtool gettext autopoint pkg-config
+```
+**macOS**
+Then, install the autotools suite via [Homebrew](https://brew.sh/):
+```bash
+brew install autoconf automake libtool gettext pkg-config
+```
 After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
 To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).

data/ext/tre_regex/extconf.rb CHANGED Viewed

@@ -2,7 +2,6 @@
 require 'mkmf'
 require 'rbconfig'
-require 'open-uri'
 require 'net/http'
 require 'fileutils'
 require 'digest'

data/lib/tre_regex/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module TreRegex
-  VERSION = '0.2.0'
+  VERSION = '0.2.2'
 end

data/lib/tre_regex.rb CHANGED Viewed

@@ -42,6 +42,12 @@ module TreRegex
     REG_NEWLINE  = 4
     REG_NOSUB    = 8
+    # TRE's regex_t struct
+    class RegexT < FFI::Struct
+      layout :re_nsub, :size_t,
+             :value,   :pointer
+    end
     # Memory layout for TRE match offsets
     class RegMatch < FFI::Struct
       layout :rm_so, :int,
@@ -82,13 +88,12 @@ module TreRegex
     def initialize(pattern, ignore_case: false)
       @pattern = pattern
-      # Allocate a safe 256-byte buffer in C memory for the regex_t struc
-      @preg = FFI::MemoryPointer.new(:char, 256)
+      @preg = Native::RegexT.new
       flags = Native::REG_EXTENDED
       flags |= Native::REG_ICASE if ignore_case
-      res = Native.tre_regcomp(@preg, pattern, flags)
+      res = Native.tre_regcomp(@preg.to_ptr, pattern, flags)
       raise TreRegex::Error, "Failed to compile regex pattern: #{pattern}" if res != 0
       # Garbage Collection Hook: Tell Ruby to free the C memory when this object is destroyed
@@ -96,10 +101,12 @@ module TreRegex
     end
     # The GC finalizer proc
-    def self.finalize(preg_ptr)
+    def self.finalize(preg)
       proc do
-        Native.tre_regfree(preg_ptr)
-        preg_ptr.free
+        # Free the internal arrays allocated by TRE
+        Native.tre_regfree(preg.to_ptr)
+        # Safely free the struct memory ourselves
+        preg.to_ptr.free
       end
     end
@@ -149,7 +156,7 @@ module TreRegex
       pmatch_array = FFI::MemoryPointer.new(Native::RegMatch, MAX_NMATCH)
       match_data = prepare_match_data(pmatch_array, MAX_NMATCH)
-      res = Native.tre_reganexec(@preg, text_ptr, len, match_data, params, 0)
+      res = Native.tre_reganexec(@preg.to_ptr, text_ptr, len, match_data, params, 0)
       return nil unless res.zero?
       # Return the entire array pointer to be parsed
@@ -192,43 +199,71 @@ module TreRegex
       end
     end
+    # Helper to safely align TRE's raw byte offsets to valid UTF-8 boundaries
+    def align_bounds(text, absolute_so, absolute_eo)
+      safe_so = absolute_so.clamp(0, text.bytesize)
+      # Shift start backward to the nearest valid character boundary
+      # (byte & 0xC0) == 0x80 checks if the byte is a UTF-8 continuation byte
+      safe_so -= 1 while safe_so.positive? && (text.getbyte(safe_so) & 0xC0) == 0x80
+      safe_eo = absolute_eo.clamp(0, text.bytesize)
+      # Shift end forward to the nearest valid character boundary
+      safe_eo += 1 while safe_eo < text.bytesize && (text.getbyte(safe_eo) & 0xC0) == 0x80
+      safe_so = safe_eo if safe_so > safe_eo
+      [safe_so, safe_eo]
+    end
     def extract_match_payload(text, byte_off, char_off, m_info)
       pmatch_array, nmatch, match_data = m_info
+      abs_so, abs_eo = primary_match_bounds(text, byte_off, pmatch_array)
-      # Read the full match boundaries from index 0
-      full_rm = Native::RegMatch.new(pmatch_array)
-      rm_so = full_rm[:rm_so]
-      rm_eo = full_rm[:rm_eo]
+      match_str = text.byteslice(abs_so...abs_eo) || ''
+      start_index = char_off + (text.byteslice(byte_off...abs_so) || '').length
-      prefix_len = (text.byteslice(byte_off, rm_so) || '').length
-      match_str = text.byteslice((byte_off + rm_so)...(byte_off + rm_eo))
+      payload = format_payload(
+        match_str, start_index, match_data,
+        extract_submatches(text, byte_off, pmatch_array, nmatch)
+      )
-      payload = {
+      [payload, abs_eo - byte_off, start_index - char_off + match_str.length]
+    end
+    def primary_match_bounds(text, byte_off, pmatch_array)
+      full_rm = Native::RegMatch.new(pmatch_array)
+      align_bounds(text, byte_off + full_rm[:rm_so], byte_off + full_rm[:rm_eo])
+    end
+    def format_payload(match_str, start_index, match_data, submatches)
+      {
         match: match_str,
-        submatches: extract_submatches(text, byte_off, pmatch_array, nmatch),
-        index: char_off + prefix_len,
-        end_index: char_off + prefix_len + match_str.length,
+        submatches:,
+        index: start_index,
+        end_index: start_index + match_str.length,
         cost: match_data[:cost],
         errors: parse_errors(match_data)
       }
-      [payload, rm_eo, prefix_len + match_str.length]
     end
     def extract_submatches(text, byte_off, pmatch_array, nmatch)
       submatches = (1...nmatch).map do |i|
         # Advance the memory pointer by the size of the struct for each index
         rm = Native::RegMatch.new(pmatch_array + (i * Native::RegMatch.size))
-        sub_so = rm[:rm_so]
-        sub_eo = rm[:rm_eo]
-        # Safely extract the group, inserting nil if it was optional and unmatched
-        sub_so == -1 ? nil : text.byteslice((byte_off + sub_so)...(byte_off + sub_eo))
+        raw_so = rm[:rm_so]
+        raw_eo = rm[:rm_eo]
+        if raw_so == -1 || raw_so > raw_eo
+          nil
+        else
+          abs_so, abs_eo = align_bounds(text, byte_off + raw_so, byte_off + raw_eo)
+          text.byteslice(abs_so...abs_eo)
+        end
       end
-      # Cleanup: Remove trailing nil values (unused capture groups)
       submatches.pop while submatches.last.nil? && !submatches.empty?
       submatches
     end

data/tre_regex.gemspec CHANGED Viewed

@@ -9,11 +9,11 @@ Gem::Specification.new do |spec|
   spec.email = ['leopard.not.a@gmail.com']
   spec.license = 'MIT'
-  spec.summary = 'A fast Ruby FFI wrapper for the TRE approximate regex matching library.'
+  spec.summary = 'A fast Ruby FFI wrapper for the TRE approximate regex matching library'
   spec.description = [
-    'TreRegex provides a high-performance Ruby interface to the TRE C library using FFI.',
+    'TreRegex provides a high-performance Ruby interface to the TRE C library.',
     'It brings robust approximate (fuzzy) regular expression matching to Ruby, featuring',
-    'multi-byte Unicode string safety, granular error limits, and precompiled cross-platform native binaries'
+    'multi-byte Unicode string safety, and granular error limits'
   ].join(' ')
   spec.homepage = 'https://github.com/le0pard/tre_regex'
   spec.required_ruby_version = '>= 3.3.0'

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: tre_regex
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.2.2
 platform: x86-linux-gnu
 authors:
 - Oleksii Vasyliev
@@ -23,10 +23,9 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '1.0'
-description: TreRegex provides a high-performance Ruby interface to the TRE C library
-  using FFI. It brings robust approximate (fuzzy) regular expression matching to Ruby,
-  featuring multi-byte Unicode string safety, granular error limits, and precompiled
-  cross-platform native binaries
+description: TreRegex provides a high-performance Ruby interface to the TRE C library.
+  It brings robust approximate (fuzzy) regular expression matching to Ruby, featuring
+  multi-byte Unicode string safety, and granular error limits
 email:
 - leopard.not.a@gmail.com
 executables: []
@@ -73,5 +72,5 @@ required_rubygems_version: !ruby/object:Gem::Requirement
 requirements: []
 rubygems_version: 4.0.6
 specification_version: 4
-summary: A fast Ruby FFI wrapper for the TRE approximate regex matching library.
+summary: A fast Ruby FFI wrapper for the TRE approximate regex matching library
 test_files: []