RubyGems - unicode-emoji - Versions diffs - 3.7.0 → 3.8.0 - Mend

unicode-emoji 3.7.0 → 3.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/CHANGELOG.md +11 -1
data/README.md +98 -55
data/Rakefile +6 -2
data/data/emoji.marshal.gz +0 -0
data/data/generate_constants.rb +97 -40
data/lib/unicode/emoji/constants.rb +17 -1
data/lib/unicode/emoji/generated/regex.rb +1 -1
data/lib/unicode/emoji/generated/regex_include_mqe.rb +8 -0
data/lib/unicode/emoji/generated/regex_include_mqe_uqe.rb +8 -0
data/lib/unicode/emoji/generated/regex_include_text.rb +1 -1
data/lib/unicode/emoji/generated/regex_text.rb +1 -1
data/lib/unicode/emoji/generated/regex_valid.rb +1 -1
data/lib/unicode/emoji/generated/regex_valid_include_text.rb +1 -1
data/lib/unicode/emoji/generated/regex_well_formed.rb +1 -1
data/lib/unicode/emoji/generated/regex_well_formed_include_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_basic.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_include_mqe.rb +8 -0
data/lib/unicode/emoji/generated_native/regex_include_mqe_uqe.rb +8 -0
data/lib/unicode/emoji/generated_native/regex_include_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_valid.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_valid_include_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_well_formed.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_well_formed_include_text.rb +1 -1
data/lib/unicode/emoji/lazy_constants.rb +36 -0
data/lib/unicode/emoji/list.rb +3 -0
data/lib/unicode/emoji.rb +33 -6
data/spec/data/.keep +0 -0
data/spec/data/emoji-test.txt +5331 -0
data/spec/emoji_test_txt_spec.rb +181 -0
data/spec/unicode_emoji_spec.rb +36 -4
metadata +12 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 1e5b069f3f3d2de97f31a0aa83568454b751330b8e86392d28baa7906a6d6302
-  data.tar.gz: afaed5c343b5cdb5f235bf1e40fd78d221b0efa3097517f015db3c5a58e96775
+  metadata.gz: e3c7cc2671d256d8208b72d719384e7c13aaace4fec6b4919b92640e5336d87f
+  data.tar.gz: 9420777da4805787467c7f4eac1580f7179a6abf56e54b011c11640a50502b88
 SHA512:
-  metadata.gz: e9c6726eee2f48cd0c51937bb0fcd6f0009b7fe0eb9e973bef713ac063d668c4d520f490a46bdf944892dcac2854b17ca72779593257f3edc8f0dfd445f3a729
-  data.tar.gz: 7c1747073a4ec43ea4ea24ff304eb2cf3deb58918f2757cd4b4daee922b15d5478bd979585b60984366d21787b3d5183bcd5fc8cd3cdea6aa5ceedb9860e19d6
+  metadata.gz: f31a83c8a492affe4ec34f2f6e43fa20a44f98dae6c0855322d5c12bc924edc154f7fa4793ca28f299f5436c8981651a5777c2e7944db6dee9c3b99f5ec997ce
+  data.tar.gz: 54c1beecbcb673274bdf98169f11fa51bc746282418c3b9ca30385194ae384e3532b04bbca3ef196316c11d3344258b9059da1ea18630d8d90b51029c608565d

data/.gitignore CHANGED Viewed

@@ -1,2 +1,3 @@
 Gemfile.lock
 /pkg
+/spec/data/emoji-test.txt

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,15 @@
 # CHANGELOG
+### 3.8.0
+- Add new RGI-based regexes `REGEX_INCLUDE_MQE` and `REGEX_INCLUDE_MQE_UQE` which allows to match
+  for minimally-qualified and unqualified RGI sequences (Emoji that lack some VS16)
+- Add specs running through `emoji-text.txt` and classify qualification statuses per regex
+- Improve documentation and add detailed table about which regex has which features
+- Native regexes: Use native Emoji props for Emoji text presentation
+- Update CLDR to v46 (valid subdivisions)
+- Further improvements (see commit log)
 ### 3.7.0
 - Bump required Ruby slightly to 2.5
@@ -29,7 +39,7 @@
 ### 3.3.2
 - Update valid subdivisions to CLDR 43 (no changes)
-  -> there won't be any new subdivision flags in Emoji
+  -> there won't be any new RGI subdivision flags in Emoji
 ### 3.3.1

data/README.md CHANGED Viewed

@@ -1,15 +1,15 @@
 # Unicode::Emoji [![[version]](https://badge.fury.io/rb/unicode-emoji.svg)](https://badge.fury.io/rb/unicode-emoji)  [![[ci]](https://github.com/janlelis/unicode-emoji/workflows/Test/badge.svg)](https://github.com/janlelis/unicode-emoji/actions?query=workflow%3ATest)
-Provides regular expressions to find Emoji in strings, incorporating the latest Unicode and Emoji standards.
+Provides regular expressions to find Emoji in strings, incorporating the latest Unicode / Emoji standards.
 Additional features:
-- A categorized list of recommended Emoji
+- A categorized list of Emoji (RGI: Recommended for General Interchange)
 - Retrieve Emoji properties info about specific codepoints (Emoji_Modifier, Emoji_Presentation, etc.)
 Emoji version: **16.0** (September 2024)
-CLDR version (used for sub-region flags): **45** (April 2024)
+CLDR version (used for sub-region flags): **46** (October 2024)
 ## Gemfile
@@ -17,16 +17,14 @@ CLDR version (used for sub-region flags): **45** (April 2024)
 gem "unicode-emoji"
 ```
-## Usage
-### Regex
+## Usage – Regex Matching
 The gem includes multiple Emoji regexes, which are compiled out of various Emoji Unicode data sources.
 ```ruby
 require "unicode/emoji"
-string = "String which contains all kinds of emoji:
+string = "String which contains all types of Emoji sequences:
 - Singleton Emoji: 😴
 - Textual singleton Emoji with Emoji variation: ▶️
@@ -35,64 +33,114 @@ string = "String which contains all kinds of emoji:
 - Sub-Region flag: 🏴󠁧󠁢󠁳󠁣󠁴󠁿
 - Keycap sequence: 2️⃣
 - Sequence using ZWJ (zero width joiner): 🤾🏽‍♀️
 "
 string.scan(Unicode::Emoji::REGEX) # => ["😴", "▶️", "🛌🏽", "🇵🇹", "🏴󠁧󠁢󠁳󠁣󠁴󠁿", "2️⃣", "🤾🏽‍♀️"]
 ```
-#### Main Regexes
+Depending on your exact usecase, you can choose between multiple levels of Emoji detection:
-There are multiple levels of Emoji detection:
+### Main Regexes
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX`       | **Use this one if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *recommended* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`
-`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `1`
-`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵` | `😴︎`, `▶`, `🏻`, `1`
-`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶`, `🏻`, `1` |
-##### Picking the Right Emoji Regex
-- Usually you just want `REGEX` (RGI set)
-- If you want broader matching (any ZJW sequences, more sub-region flags), choose `REGEX_VALID`
-- Even brolader is `REGEX_WELL_FORMED`, which will also match any region flag and any tag sequence
-- And then there is `REGEX_POSSIBLE` , which is a quick check for possible Emoji, which might contain false positives, [suggested in the Unicode Standard](https://www.unicode.org/reports/tr51/#EBNF_and_Regex)
-Property | Escaped | `REGEX` (RGI / Recommended) | `REGEX_VALID` (Valid) | `REGEX_WELL_FORMED` (Well-formed) | `REGEX_POSSIBLE`
----------|---------|-----------------------------|-----------------------|-----------------------------------|-----------------
-Region "🇵🇹" | `\u{1F1F5 1F1F9}` | Yes | Yes | Yes | Yes
-Region "🇵🇵" | `\u{1F1F5 1F1F5}` | No  | No  | Yes | Yes
-Tag Sequence "🏴󠁧󠁢󠁳󠁣󠁴󠁿" | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | Yes | Yes | Yes | Yes
-Tag Sequence "🏴󠁧󠁢󠁡󠁧󠁢󠁿" | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | No  | Yes | Yes | Yes
-Tag Sequence "😴󠁧󠁢󠁡󠁡󠁡󠁿" | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | No  | No  | Yes | Yes
-ZWJ Sequence "🤾🏽‍♀️" | `\u{1F93E 1F3FD 200D 2640 FE0F}` | Yes | Yes | Yes | Yes
-ZWJ Sequence "🤠‍🤢" | `\u{1F920 200D 1F922}` | No  | Yes | Yes | Yes
+`Unicode::Emoji::REGEX`       | **Use this one if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *recommended* Emoji sequences (RGI/FQE) | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` |  `🤾🏽‍♀`, `🏌‍♂️`, `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀` ,`🏌‍♂️`, `🤠‍🤢` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`,`🏌‍♂️` , `🤠‍🤢`,  `🇵🇵` | `😴︎`, `▶`, `🏻`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits (except for: unqualified keycap sequences) | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶`, `🏻`, `1` | `1⃣`
-Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for more details, examples, explanations.
+#### Include Text Emoji
-More info about valid vs. recommended Emoji can also be found in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
+By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
-#### Singleton Regexes
+Regex                         | Description | Example Matches | Example Non-Matches
+------------------------------|-------------|-----------------|--------------------
+`Unicode::Emoji::REGEX_INCLUDE_TEXT`       | `REGEX` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `😴︎`, `▶`, `1⃣` | `🤾🏽‍♀`, `🏌‍♂️`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`
+`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`, `😴︎`, `▶`, `1⃣` | `🏻`, `🇵🇵`, `1`
+`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶`, `1⃣` | `🏻`, `1`
-Matches only simple one-codepoint (+ optional variation selector) Emoji:
+#### Minimally-qualified and Unqualified Sequences
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | `😴`, `▶️` | `😴︎`, `▶`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `1`
-`Unicode::Emoji::REGEX_TEXT`  | Matches only textual singleton Emoji (except for singleton components, like digits) | `😴︎`, `▶` | `😴`, `▶️`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `1`
+`Unicode::Emoji::REGEX_INCLUDE_MQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors, where the first partial Emoji has all required Variation Selectors | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀` | `🏌‍♂️`, `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_INCLUDE_MQE_UQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`, `1⃣`
-#### Include Textual Emoji
+[List of MQE and UQE Emoji sequences](https://character.construction/unqualified-emoji)
-By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
+#### Singleton Regexes
+Matches only simple one-codepoint (+ optional variation selector) Emoji:
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX_INCLUDE_TEXT`       | `REGEX` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `😴︎`, `▶` | `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
-`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `😴︎`, `▶` | `🏻`, `🇵🇵`
-`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶` | `🏻`
+`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | `😴`, `▶️` | `😴︎`, `▶`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`, `1`
+`Unicode::Emoji::REGEX_TEXT`  | Matches only textual singleton Emoji (except for singleton components, like digits) | `😴︎`, `▶` | `😴`, `▶️`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`, `1`
+Here is a list of all Emoji that can be matched using the two regexes: [character.construction/emoji-vs-text](https://character.construction/emoji-vs-text)
+While `REGEX_BASIC` is part of the above regexes, `REGEX_TEXT` is only included in the `*_INCLUDE_TEXT` or `*_UQE` variants.
+### Comparison
+1) Fully-qualified RGI Emoji ZWJ sequence
+2) Minimally-qualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selectors, but not in the first Emoji character)
+3) Unqualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selector, including in the first Emoji character). Unqualified Emoji include all basic Emoji in Text Presentation (see column 11/12).
+4) Non-RGI Emoji ZWJ sequence
+5) Valid Region made from a pair of Regional Indicators
+6) Any Region made from a pair of Regional Indicators
+7) RGI Flag Emoji Tag Sequences (England, Scotland, Wales)
+8) Valid Flag Emoji Tag Sequences (any known subdivision)
+9) Any Emoji Tag Sequences (any tag sequence with any base)
+10) Basic Default Emoji Presentation Characters or Text characters with Emoji Presentation Selector
+11) Basic Default Text Presentation Characters or Basic Emoji with Text Presentation Selector
+12) Non-Emoji (unqualified) keycap
+Regex | 1 RGI/FQE | 2 RGI/MQE | 3 RGI/UQE | 4 Non-RGI | 5 Valid Region | 6 Any Region | 7 RGI Tag | 8 Valid Tag | 9 Any Tag | 10 Basic Emoji | 11 Basic Text | 12 Text Keycap
+-|-|-|-|-|-|-|-|-|-|-|-|-
+REGEX                          | ✅ | ❌ | ❌    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌
+REGEX INCLUDE TEXT             | ✅ | ❌ | ❌    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅
+REGEX INCLUDE MQE              | ✅ | ✅ | ❌    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌
+REGEX INCLUDE MQE UQE          | ✅ | ✅ | ✅    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅
+REGEX VALID                    | ✅ | ✅ | (✅)¹ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌
+REGEX VALID INCLUDE TEXT       | ✅ | ✅ | ✅    | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅
+REGEX WELL FORMED              | ✅ | ✅ | (✅)¹ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌
+REGEX WELL FORMED INCLUDE TEXT | ✅ | ✅ | ✅    | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅
+REGEX POSSIBLE                 | ✅ | ✅ | ✅    | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌
+REGEX BASIC                    | ❌ | ❌ | ❌    | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌
+REGEX TEXT                     | ❌ | ❌ | ❌    | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅
+¹ Matches all unqualified Emoji, except for textual singleton Emoji (see columns 11, 12)
+See [spec files](/spec) for detailed examples about which regex matches which kind of Emoji.
+### Picking the Right Emoji Regex
+- Usually you just want `REGEX` (recommended Emoji set, RGI)
+- Use `REGEX_INCLUDE_MQE` or `REGEX_INCLUDE_MQE_UQE` if you want to catch Emoji sequences with missing Variation Selectors.
+- If you want broader matching (any ZWJ sequences, more sub-region flags), choose `REGEX_VALID`
+- If you need to match any region flag and any tag sequence, choose `REGEX_WELL_FORMED`
+- Use the `_INCLUDE_TEXT` suffix with any of the above base regexes, if you want to also match basic textual Emoji
+- And finally, there is also the option to use `REGEX_POSSIBLE`, which is a simplified test for possible Emoji, comparable to `REGEX_WELL_FORMED*`. It might contain false positives, however, the regex is less complex and [suggested in the Unicode standard itself](https://www.unicode.org/reports/tr51/#EBNF_and_Regex) as a first check.
+### Examples
+Desc | Emoji | Escaped | `REGEX` (RGI/FQE) | `REGEX_INCLUDE_MQE` (RGI/MQE) | `REGEX_VALID` | `REGEX_WELL_FORMED` / `REGEX_POSSIBLE`
+-----|-------|---------|---------------|-----------------------|-----------------------------------|-----------------
+RGI ZWJ Sequence   | 🤾🏽‍♀️ | `\u{1F93E 1F3FD 200D 2640 FE0F}` | ✅ | ✅ | ✅ | ✅
+RGI ZWJ Sequence MQE | 🤾🏽‍♀ | `\u{1F93E 1F3FD 200D 2640}` | ❌ | ✅ | ✅ | ✅
+Valid ZWJ Sequence, Non-RGI | 🤠‍🤢 | `\u{1F920 200D 1F922}` | ❌ | ❌  | ✅ | ✅
+Known Region       | 🇵🇹 | `\u{1F1F5 1F1F9}` | ✅ | ✅ | ✅ | ✅
+Unknown Region     | 🇵🇵 | `\u{1F1F5 1F1F5}` | ❌ | ❌  | ❌  | ✅
+RGI Tag Sequence   | 🏴󠁧󠁢󠁳󠁣󠁴󠁿 | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | ✅ | ✅ | ✅ | ✅
+Valid Tag Sequence | 🏴󠁧󠁢󠁡󠁧󠁢󠁿 | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | ❌ | ❌  | ✅ | ✅
+Well-formed Tag Sequence | 😴󠁧󠁢󠁡󠁡󠁡󠁿 | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | ❌ | ❌  | ❌  | ✅
+Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for more details, examples, explanations.
+More info about valid vs. recommended Emoji can also be found in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
-#### Extended Pictographic Regex
+### Extended Pictographic Regex
 `Unicode::Emoji::REGEX_PICTO` matches single codepoints with the **Extended_Pictographic** property. For example, it will match `✀` BLACK SAFETY SCISSORS.
@@ -100,18 +148,13 @@ Regex                         | Description | Example Matches | Example Non-Matc
 See [character.construction/picto](https://character.construction/picto) for a list of all non-Emoji pictographic characters.
-#### Partial Regexes
-Matches potential Emoji parts (often, this is not what you want):
-Regex                         | Description | Example Matches | Example Non-Matches
-------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX_ANY`   | Matches any Emoji-related codepoint (but no variation selectors, tags, or zero-width joiners). Please not that this will match Emoji-parts rather than complete Emoji, for example, single digits! | `😴`, `▶`, `🏻`, `🛌`, `🏽`, `🇵`, `🇹`, `2`, `🏴`, `🤾`, `♀`, `🤠`, `🤢` | -
+### Partial Regexes
+`Unicode::Emoji::REGEX_ANY`, same as `\p{Emoji}`. Deprecated: Will be removed or renamed in the future.
-### List
+## Usage – List
-Use `Unicode::Emoji::LIST` or the list method to get a grouped (and ordered) list of Emoji:
+Use `Unicode::Emoji::LIST` or the **list** method to get a ordered and categorized list of Emoji:
 ```ruby
 Unicode::Emoji.list.keys
@@ -124,13 +167,13 @@ Unicode::Emoji.list("Food & Drink", "food-asian")
 => ["🍱", "🍘", "🍙", "🍚", "🍛", "🍜", "🍝", "🍠", "🍢", "🍣", "🍤", "🍥", "🥮", "🍡", "🥟", "🥠", "🥡"]
 ```
-Please note that categories might change with future versions of the Emoji standard. This gem will issue warnings when attempting to retrieve old categories using the `#list` method.
+Please note that categories might change with future versions of the Emoji standard, although this has not happened often.
 A list of all Emoji (generated from this gem) can be found at [character.construction/emoji](https://character.construction/emoji).
-### Properties
+## Usage – Properties Data
-Allows you to access the codepoint data form Unicode's [emoji-data.txt](https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt) file:
+Allows you to access the codepoint data for a single character form Unicode's [emoji-data.txt](https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt) file:
 ```ruby
 require "unicode/emoji"

data/Rakefile CHANGED Viewed

@@ -28,14 +28,18 @@ task :irb do
 end
 # # #
-# Run Specs
+# Run specs
 desc "#{gemspec.name} | Spec"
 task :spec do
-  ruby "spec/unicode_emoji_spec.rb"
+  ruby File.join("spec", "*_spec.rb")
 end
 task default: :spec
+# # #
+# Generate regex
 desc "#{gemspec.name} | Generates all regex constants and saves them to lib/unicode/emoji/{generated,generated_native} directories"
 task :generate_constants do
   load "data/generate_constants.rb", true

data/data/emoji.marshal.gz CHANGED Viewed

Binary file

data/data/generate_constants.rb CHANGED Viewed

@@ -68,10 +68,10 @@ def pack_and_join(ords)
   end
 end
-def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_component:, emoji_presentation:, picto:, picto_no_emoji:)
+def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_component:, emoji_presentation:, text_presentation:, picto:, picto_no_emoji:)
   emoji_presentation_sequence = \
     join(
-      pack_and_join(TEXT_PRESENTATION) + pack(EMOJI_VARIATION_SELECTOR),
+      text_presentation + pack(EMOJI_VARIATION_SELECTOR),
       emoji_presentation + "(?!" + pack(TEXT_VARIATION_SELECTOR) + ")" + pack(EMOJI_VARIATION_SELECTOR) + "?",
     )
@@ -79,14 +79,20 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
     "(?!" + emoji_component + ")" + emoji_presentation_sequence
   text_keycap_sequence = \
-    join(EMOJI_KEYCAPS.map{|keycap| pack([keycap, EMOJI_KEYCAP_SUFFIX]) })
+    pack_and_join(EMOJI_KEYCAPS) + pack(EMOJI_KEYCAP_SUFFIX)
   text_presentation_sequence = \
     join(
-      pack_and_join(TEXT_PRESENTATION)+ "(?!" + join(emoji_modifier, pack(EMOJI_VARIATION_SELECTOR)) + ")" + pack(TEXT_VARIATION_SELECTOR) + "?",
+      text_presentation + "(?!" + join(emoji_modifier, pack(EMOJI_VARIATION_SELECTOR)) + ")" + pack(TEXT_VARIATION_SELECTOR) + "?",
       emoji_presentation + pack(TEXT_VARIATION_SELECTOR),
     )
+  text_emoji = \
+    join(
+      "(?!" + emoji_component + ")" + text_presentation_sequence,
+      text_keycap_sequence,
+    )
   emoji_modifier_sequence = \
     emoji_modifier_base + emoji_modifier
@@ -99,22 +105,11 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
   emoji_well_formed_flag_sequence = \
     '\p{RI}{2}'
-  emoji_valid_core_sequence = \
-    join(
-      # emoji_character,
-      emoji_keycap_sequence,
-      emoji_modifier_sequence,
-      non_component_emoji_presentation_sequence,
-      emoji_valid_flag_sequence,
-    )
-  emoji_well_formed_core_sequence = \
+  emoji_core_sequence = \
     join(
-      # emoji_character,
       emoji_keycap_sequence,
       emoji_modifier_sequence,
       non_component_emoji_presentation_sequence,
-      emoji_well_formed_flag_sequence,
     )
   # Sort to make sure complex sequences match first
@@ -144,6 +139,18 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
   emoji_rgi_zwj_sequence = \
     pack_and_join(RECOMMENDED_ZWJ_SEQUENCES.sort_by(&:length).reverse)
+  # FQE+MQE: Make VS16 optional after ZWJ has appeared
+  emoji_rgi_include_mqe_zwj_sequence = emoji_rgi_zwj_sequence.gsub(
+      /#{ pack(ZWJ) }[^|]+?\K#{ pack(EMOJI_VARIATION_SELECTOR) }/,
+      pack(EMOJI_VARIATION_SELECTOR) + "?"
+    )
+  # FQE+MQE+UQE: Make all VS16 optional
+  emoji_rgi_include_mqe_uqe_zwj_sequence = emoji_rgi_zwj_sequence.gsub(
+      pack(EMOJI_VARIATION_SELECTOR),
+      pack(EMOJI_VARIATION_SELECTOR) + "?",
+    )
   emoji_valid_zwj_element = \
     join(
       emoji_modifier_sequence,
@@ -160,21 +167,68 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
     join(
       emoji_rgi_zwj_sequence,
       emoji_rgi_tag_sequence,
-      emoji_valid_core_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_rgi_sequence_include_text = \
+    join(
+      emoji_rgi_zwj_sequence,
+      emoji_rgi_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+      text_emoji,
+    )
+  emoji_rgi_include_mqe_sequence = \
+    join(
+      emoji_rgi_include_mqe_zwj_sequence,
+      emoji_rgi_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_rgi_include_mqe_uqe_sequence = \
+    join(
+      emoji_rgi_include_mqe_uqe_zwj_sequence,
+      text_emoji, # also uqe
+      emoji_rgi_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
     )
   emoji_valid_sequence = \
     join(
       emoji_valid_zwj_sequence,
       emoji_valid_tag_sequence,
-      emoji_valid_core_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_valid_sequence_include_text = \
+    join(
+      emoji_valid_zwj_sequence,
+      emoji_valid_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+      text_emoji,
     )
   emoji_well_formed_sequence = \
     join(
       emoji_valid_zwj_sequence,
       emoji_well_formed_tag_sequence,
-      emoji_well_formed_core_sequence,
+      emoji_well_formed_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_well_formed_sequence_include_text = \
+    join(
+      emoji_valid_zwj_sequence,
+      emoji_well_formed_tag_sequence,
+      emoji_well_formed_flag_sequence,
+      emoji_core_sequence,
+      text_emoji,
     )
   emoji_possible_modification = \
@@ -198,41 +252,42 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
   # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
   regexes[:REGEX] = Regexp.compile(emoji_rgi_sequence)
+  # rgi + singleton text
+  regexes[:REGEX_INCLUDE_TEXT] = Regexp.compile(emoji_rgi_sequence_include_text)
+  # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
+  # Also make VS16 optional if not at first emoji character
+  regexes[:REGEX_INCLUDE_MQE] = Regexp.compile(emoji_rgi_include_mqe_sequence)
+  # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
+  # Also make VS16 optional even at first emoji character
+  regexes[:REGEX_INCLUDE_MQE_UQE] = Regexp.compile(emoji_rgi_include_mqe_uqe_sequence)
   # Matches basic singleton emoji and all kind of valid sequences
   regexes[:REGEX_VALID] = Regexp.compile(emoji_valid_sequence)
+  # valid + singleton text
+  regexes[:REGEX_VALID_INCLUDE_TEXT] = Regexp.compile(emoji_valid_sequence_include_text)
   # Matches basic singleton emoji and all kind of sequences
   regexes[:REGEX_WELL_FORMED] = Regexp.compile(emoji_well_formed_sequence)
+  # well-formed + singleton text
+  regexes[:REGEX_WELL_FORMED_INCLUDE_TEXT] = Regexp.compile(emoji_well_formed_sequence_include_text)
   # Quick test which might lead to false positves
   # See https://www.unicode.org/reports/tr51/#EBNF_and_Regex
   regexes[:REGEX_POSSIBLE] = Regexp.compile(emoji_possible)
-  # Matches only basic single, non-textual emoji
-  # Ignores "components" like modifiers or simple digits
-  regexes[:REGEX_BASIC] = Regexp.compile(
-    "(?!" + emoji_component + ")" + emoji_presentation_sequence
-  )
+  # Matches only basic single, non-textual emoji, ignores "components" like modifiers or simple digits
+  regexes[:REGEX_BASIC] = Regexp.compile(non_component_emoji_presentation_sequence)
-  # Matches only basic single, textual emoji
-  # Ignores "components" like modifiers or simple digits
-  regexes[:REGEX_TEXT] = Regexp.compile(
-    join(
-      "(?!" + emoji_component + ")" + text_presentation_sequence,
-      text_keycap_sequence,
-    )
-  )
+  # Matches only basic single, textual emoji, ignores "components" like modifiers or simple digits
+  regexes[:REGEX_TEXT] = Regexp.compile(text_emoji)
-  # Matches any emoji-related codepoint - Use with caution (returns partial matches)
+  # Same as \p{Emoji} - to be removed or renamed
   regexes[:REGEX_ANY] = Regexp.compile(emoji_character)
-  # Combined REGEXes which also match for TEXTUAL emoji
-  regexes[:REGEX_INCLUDE_TEXT] = Regexp.union(regexes[:REGEX], regexes[:REGEX_TEXT])
-  regexes[:REGEX_VALID_INCLUDE_TEXT] = Regexp.union(regexes[:REGEX_VALID], regexes[:REGEX_TEXT])
-  regexes[:REGEX_WELL_FORMED_INCLUDE_TEXT] = Regexp.union(regexes[:REGEX_WELL_FORMED], regexes[:REGEX_TEXT])
   regexes[:REGEX_PICTO] = Regexp.compile(picto)
   regexes[:REGEX_PICTO_NO_EMOJI] = Regexp.compile(picto_no_emoji)
@@ -246,6 +301,7 @@ regexes = compile(
   emoji_modifier_base:  pack_and_join(EMOJI_MODIFIER_BASES),
   emoji_component:      pack_and_join(EMOJI_COMPONENT),
   emoji_presentation:   pack_and_join(EMOJI_PRESENTATION),
+  text_presentation:    pack_and_join(TEXT_PRESENTATION),
   picto:                pack_and_join(EXTENDED_PICTOGRAPHIC),
   picto_no_emoji:       pack_and_join(EXTENDED_PICTOGRAPHIC_NO_EMOJI)
 )
@@ -257,6 +313,7 @@ native_regexes = compile(
   emoji_modifier_base:  "\\p{EBase}",
   emoji_component:      "\\p{EComp}",
   emoji_presentation:   "\\p{EPres}",
+  text_presentation:    "\\p{Emoji}(?<!\\p{EPres})",
   picto:                "\\p{ExtPict}",
   picto_no_emoji:       "\\p{ExtPict}(?<!\\p{Emoji})"
 )

data/lib/unicode/emoji/constants.rb CHANGED Viewed

@@ -2,12 +2,13 @@
 module Unicode
   module Emoji
-    VERSION = "3.7.0"
+    VERSION = "3.8.0"
     EMOJI_VERSION = "16.0"
     CLDR_VERSION = "45"
     DATA_DIRECTORY = File.expand_path('../../../data', __dir__).freeze
     INDEX_FILENAME = (DATA_DIRECTORY + "/emoji.marshal.gz").freeze
+    # Unicode properties, see https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt
     PROPERTY_NAMES = {
       E: "Emoji",
       B: "Emoji_Modifier_Base",
@@ -17,13 +18,28 @@ module Unicode
       X: "Extended_Pictographic",
     }.freeze
+    # Variation Selector 16 (VS16), enables emoji presentation mode for preceding codepoint
     EMOJI_VARIATION_SELECTOR      = 0xFE0F
+    # Variation Selector 15 (VS15), enables text presentation mode for preceding codepoint
     TEXT_VARIATION_SELECTOR       = 0xFE0E
+    # First codepoint of tag-based subdivision flags
     EMOJI_TAG_BASE_FLAG           = 0x1F3F4
+    # Last codepoint of tag-based subdivision flags
     CANCEL_TAG                    = 0xE007F
+    # Tags characters allowed in tag-based subdivision flags
     SPEC_TAGS                     = [*0xE0030..0xE0039, *0xE0061..0xE007A].freeze
+    # Combining Enclosing Keycap character
     EMOJI_KEYCAP_SUFFIX           = 0x20E3
+    # Zero-width-joiner to enable combination of multiple Emoji in a sequence
     ZWJ                           = 0x200D
+    # Two regional indicators make up a region
     REGIONAL_INDICATORS           = [*0x1F1E6..0x1F1FF].freeze
   end
 end