RubyGems - unicode-emoji - Versions diffs - 3.7.0 → 3.8.0 - Mend

unicode-emoji 3.7.0 → 3.8.0

Files changed (35) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/CHANGELOG.md +11 -1
data/README.md +98 -55
data/Rakefile +6 -2
data/data/emoji.marshal.gz +0 -0
data/data/generate_constants.rb +97 -40
data/lib/unicode/emoji/constants.rb +17 -1
data/lib/unicode/emoji/generated/regex.rb +1 -1
data/lib/unicode/emoji/generated/regex_include_mqe.rb +8 -0
data/lib/unicode/emoji/generated/regex_include_mqe_uqe.rb +8 -0
data/lib/unicode/emoji/generated/regex_include_text.rb +1 -1
data/lib/unicode/emoji/generated/regex_text.rb +1 -1
data/lib/unicode/emoji/generated/regex_valid.rb +1 -1
data/lib/unicode/emoji/generated/regex_valid_include_text.rb +1 -1
data/lib/unicode/emoji/generated/regex_well_formed.rb +1 -1
data/lib/unicode/emoji/generated/regex_well_formed_include_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_basic.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_include_mqe.rb +8 -0
data/lib/unicode/emoji/generated_native/regex_include_mqe_uqe.rb +8 -0
data/lib/unicode/emoji/generated_native/regex_include_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_valid.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_valid_include_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_well_formed.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_well_formed_include_text.rb +1 -1
data/lib/unicode/emoji/lazy_constants.rb +36 -0
data/lib/unicode/emoji/list.rb +3 -0
data/lib/unicode/emoji.rb +33 -6
data/spec/data/.keep +0 -0
data/spec/data/emoji-test.txt +5331 -0
data/spec/emoji_test_txt_spec.rb +181 -0
data/spec/unicode_emoji_spec.rb +36 -4
metadata +12 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 1e5b069f3f3d2de97f31a0aa83568454b751330b8e86392d28baa7906a6d6302
-  data.tar.gz: afaed5c343b5cdb5f235bf1e40fd78d221b0efa3097517f015db3c5a58e96775
+  metadata.gz: e3c7cc2671d256d8208b72d719384e7c13aaace4fec6b4919b92640e5336d87f
+  data.tar.gz: 9420777da4805787467c7f4eac1580f7179a6abf56e54b011c11640a50502b88
 SHA512:
-  metadata.gz: e9c6726eee2f48cd0c51937bb0fcd6f0009b7fe0eb9e973bef713ac063d668c4d520f490a46bdf944892dcac2854b17ca72779593257f3edc8f0dfd445f3a729
-  data.tar.gz: 7c1747073a4ec43ea4ea24ff304eb2cf3deb58918f2757cd4b4daee922b15d5478bd979585b60984366d21787b3d5183bcd5fc8cd3cdea6aa5ceedb9860e19d6
+  metadata.gz: f31a83c8a492affe4ec34f2f6e43fa20a44f98dae6c0855322d5c12bc924edc154f7fa4793ca28f299f5436c8981651a5777c2e7944db6dee9c3b99f5ec997ce
+  data.tar.gz: 54c1beecbcb673274bdf98169f11fa51bc746282418c3b9ca30385194ae384e3532b04bbca3ef196316c11d3344258b9059da1ea18630d8d90b51029c608565d

data/.gitignore CHANGED Viewed

@@ -1,2 +1,3 @@
 Gemfile.lock
 /pkg
+/spec/data/emoji-test.txt

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,15 @@
 # CHANGELOG
+### 3.8.0
+- Add new RGI-based regexes `REGEX_INCLUDE_MQE` and `REGEX_INCLUDE_MQE_UQE` which allows to match
+  for minimally-qualified and unqualified RGI sequences (Emoji that lack some VS16)
+- Add specs running through `emoji-text.txt` and classify qualification statuses per regex
+- Improve documentation and add detailed table about which regex has which features
+- Native regexes: Use native Emoji props for Emoji text presentation
+- Update CLDR to v46 (valid subdivisions)
+- Further improvements (see commit log)
 ### 3.7.0
 - Bump required Ruby slightly to 2.5
@@ -29,7 +39,7 @@
 ### 3.3.2
 - Update valid subdivisions to CLDR 43 (no changes)
-  -> there won't be any new subdivision flags in Emoji
+  -> there won't be any new RGI subdivision flags in Emoji
 ### 3.3.1

data/README.md CHANGED Viewed

@@ -1,15 +1,15 @@
 # Unicode::Emoji [![[version]](https://badge.fury.io/rb/unicode-emoji.svg)](https://badge.fury.io/rb/unicode-emoji)  [![[ci]](https://github.com/janlelis/unicode-emoji/workflows/Test/badge.svg)](https://github.com/janlelis/unicode-emoji/actions?query=workflow%3ATest)
-Provides regular expressions to find Emoji in strings, incorporating the latest Unicode and Emoji standards.
+Provides regular expressions to find Emoji in strings, incorporating the latest Unicode / Emoji standards.
 Additional features:
-- A categorized list of recommended Emoji
+- A categorized list of Emoji (RGI: Recommended for General Interchange)
 - Retrieve Emoji properties info about specific codepoints (Emoji_Modifier, Emoji_Presentation, etc.)
 Emoji version: **16.0** (September 2024)
-CLDR version (used for sub-region flags): **45** (April 2024)
+CLDR version (used for sub-region flags): **46** (October 2024)
 ## Gemfile
@@ -17,16 +17,14 @@ CLDR version (used for sub-region flags): **45** (April 2024)
 gem "unicode-emoji"
 ```
-## Usage
-### Regex
+## Usage – Regex Matching
 The gem includes multiple Emoji regexes, which are compiled out of various Emoji Unicode data sources.
 ```ruby
 require "unicode/emoji"
-string = "String which contains all kinds of emoji:
+string = "String which contains all types of Emoji sequences:
 - Singleton Emoji: 😴
 - Textual singleton Emoji with Emoji variation: ▶️
@@ -35,64 +33,114 @@ string = "String which contains all kinds of emoji:
 - Sub-Region flag: 🏴󠁧󠁢󠁳󠁣󠁴󠁿
 - Keycap sequence: 2️⃣
 - Sequence using ZWJ (zero width joiner): 🤾🏽‍♀️
 "
 string.scan(Unicode::Emoji::REGEX) # => ["😴", "▶️", "🛌🏽", "🇵🇹", "🏴󠁧󠁢󠁳󠁣󠁴󠁿", "2️⃣", "🤾🏽‍♀️"]
 ```
-#### Main Regexes
+Depending on your exact usecase, you can choose between multiple levels of Emoji detection:
-There are multiple levels of Emoji detection:
+### Main Regexes
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX`       | **Use this one if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *recommended* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`
-`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `1`
-`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵` | `😴︎`, `▶`, `🏻`, `1`
-`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶`, `🏻`, `1` |
-##### Picking the Right Emoji Regex
-- Usually you just want `REGEX` (RGI set)
-- If you want broader matching (any ZJW sequences, more sub-region flags), choose `REGEX_VALID`
-- Even brolader is `REGEX_WELL_FORMED`, which will also match any region flag and any tag sequence
-- And then there is `REGEX_POSSIBLE` , which is a quick check for possible Emoji, which might contain false positives, [suggested in the Unicode Standard](https://www.unicode.org/reports/tr51/#EBNF_and_Regex)
-Property | Escaped | `REGEX` (RGI / Recommended) | `REGEX_VALID` (Valid) | `REGEX_WELL_FORMED` (Well-formed) | `REGEX_POSSIBLE`
----------|---------|-----------------------------|-----------------------|-----------------------------------|-----------------
-Region "🇵🇹" | `\u{1F1F5 1F1F9}` | Yes | Yes | Yes | Yes
-Region "🇵🇵" | `\u{1F1F5 1F1F5}` | No  | No  | Yes | Yes
-Tag Sequence "🏴󠁧󠁢󠁳󠁣󠁴󠁿" | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | Yes | Yes | Yes | Yes
-Tag Sequence "🏴󠁧󠁢󠁡󠁧󠁢󠁿" | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | No  | Yes | Yes | Yes
-Tag Sequence "😴󠁧󠁢󠁡󠁡󠁡󠁿" | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | No  | No  | Yes | Yes
-ZWJ Sequence "🤾🏽‍♀️" | `\u{1F93E 1F3FD 200D 2640 FE0F}` | Yes | Yes | Yes | Yes
-ZWJ Sequence "🤠‍🤢" | `\u{1F920 200D 1F922}` | No  | Yes | Yes | Yes
+`Unicode::Emoji::REGEX`       | **Use this one if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *recommended* Emoji sequences (RGI/FQE) | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` |  `🤾🏽‍♀`, `🏌‍♂️`, `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀` ,`🏌‍♂️`, `🤠‍🤢` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`,`🏌‍♂️` , `🤠‍🤢`,  `🇵🇵` | `😴︎`, `▶`, `🏻`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits (except for: unqualified keycap sequences) | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶`, `🏻`, `1` | `1⃣`
-Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for more details, examples, explanations.
+#### Include Text Emoji
-More info about valid vs. recommended Emoji can also be found in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
+By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
-#### Singleton Regexes
+Regex                         | Description | Example Matches | Example Non-Matches
+------------------------------|-------------|-----------------|--------------------
+`Unicode::Emoji::REGEX_INCLUDE_TEXT`       | `REGEX` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `😴︎`, `▶`, `1⃣` | `🤾🏽‍♀`, `🏌‍♂️`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`
+`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`, `😴︎`, `▶`, `1⃣` | `🏻`, `🇵🇵`, `1`
+`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶`, `1⃣` | `🏻`, `1`
-Matches only simple one-codepoint (+ optional variation selector) Emoji:
+#### Minimally-qualified and Unqualified Sequences
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | `😴`, `▶️` | `😴︎`, `▶`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `1`
-`Unicode::Emoji::REGEX_TEXT`  | Matches only textual singleton Emoji (except for singleton components, like digits) | `😴︎`, `▶` | `😴`, `▶️`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `1`
+`Unicode::Emoji::REGEX_INCLUDE_MQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors, where the first partial Emoji has all required Variation Selectors | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀` | `🏌‍♂️`, `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_INCLUDE_MQE_UQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`, `1⃣`
-#### Include Textual Emoji
+[List of MQE and UQE Emoji sequences](https://character.construction/unqualified-emoji)
-By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
+#### Singleton Regexes
+Matches only simple one-codepoint (+ optional variation selector) Emoji:
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX_INCLUDE_TEXT`       | `REGEX` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `😴︎`, `▶` | `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
-`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `😴︎`, `▶` | `🏻`, `🇵🇵`
-`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶` | `🏻`
+`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | `😴`, `▶️` | `😴︎`, `▶`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`, `1`
+`Unicode::Emoji::REGEX_TEXT`  | Matches only textual singleton Emoji (except for singleton components, like digits) | `😴︎`, `▶` | `😴`, `▶️`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`, `1`
+Here is a list of all Emoji that can be matched using the two regexes: [character.construction/emoji-vs-text](https://character.construction/emoji-vs-text)
+While `REGEX_BASIC` is part of the above regexes, `REGEX_TEXT` is only included in the `*_INCLUDE_TEXT` or `*_UQE` variants.
+### Comparison
+1) Fully-qualified RGI Emoji ZWJ sequence
+2) Minimally-qualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selectors, but not in the first Emoji character)
+3) Unqualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selector, including in the first Emoji character). Unqualified Emoji include all basic Emoji in Text Presentation (see column 11/12).
+4) Non-RGI Emoji ZWJ sequence
+5) Valid Region made from a pair of Regional Indicators
+6) Any Region made from a pair of Regional Indicators
+7) RGI Flag Emoji Tag Sequences (England, Scotland, Wales)
+8) Valid Flag Emoji Tag Sequences (any known subdivision)
+9) Any Emoji Tag Sequences (any tag sequence with any base)
+10) Basic Default Emoji Presentation Characters or Text characters with Emoji Presentation Selector
+11) Basic Default Text Presentation Characters or Basic Emoji with Text Presentation Selector
+12) Non-Emoji (unqualified) keycap
+Regex | 1 RGI/FQE | 2 RGI/MQE | 3 RGI/UQE | 4 Non-RGI | 5 Valid Region | 6 Any Region | 7 RGI Tag | 8 Valid Tag | 9 Any Tag | 10 Basic Emoji | 11 Basic Text | 12 Text Keycap
+-|-|-|-|-|-|-|-|-|-|-|-|-
+REGEX                          | ✅ | ❌ | ❌    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌
+REGEX INCLUDE TEXT             | ✅ | ❌ | ❌    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅
+REGEX INCLUDE MQE              | ✅ | ✅ | ❌    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌
+REGEX INCLUDE MQE UQE          | ✅ | ✅ | ✅    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅
+REGEX VALID                    | ✅ | ✅ | (✅)¹ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌
+REGEX VALID INCLUDE TEXT       | ✅ | ✅ | ✅    | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅
+REGEX WELL FORMED              | ✅ | ✅ | (✅)¹ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌
+REGEX WELL FORMED INCLUDE TEXT | ✅ | ✅ | ✅    | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅
+REGEX POSSIBLE                 | ✅ | ✅ | ✅    | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌
+REGEX BASIC                    | ❌ | ❌ | ❌    | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌
+REGEX TEXT                     | ❌ | ❌ | ❌    | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅
+¹ Matches all unqualified Emoji, except for textual singleton Emoji (see columns 11, 12)
+See [spec files](/spec) for detailed examples about which regex matches which kind of Emoji.
+### Picking the Right Emoji Regex
+- Usually you just want `REGEX` (recommended Emoji set, RGI)
+- Use `REGEX_INCLUDE_MQE` or `REGEX_INCLUDE_MQE_UQE` if you want to catch Emoji sequences with missing Variation Selectors.
+- If you want broader matching (any ZWJ sequences, more sub-region flags), choose `REGEX_VALID`
+- If you need to match any region flag and any tag sequence, choose `REGEX_WELL_FORMED`
+- Use the `_INCLUDE_TEXT` suffix with any of the above base regexes, if you want to also match basic textual Emoji
+- And finally, there is also the option to use `REGEX_POSSIBLE`, which is a simplified test for possible Emoji, comparable to `REGEX_WELL_FORMED*`. It might contain false positives, however, the regex is less complex and [suggested in the Unicode standard itself](https://www.unicode.org/reports/tr51/#EBNF_and_Regex) as a first check.
+### Examples
+Desc | Emoji | Escaped | `REGEX` (RGI/FQE) | `REGEX_INCLUDE_MQE` (RGI/MQE) | `REGEX_VALID` | `REGEX_WELL_FORMED` / `REGEX_POSSIBLE`
+-----|-------|---------|---------------|-----------------------|-----------------------------------|-----------------
+RGI ZWJ Sequence   | 🤾🏽‍♀️ | `\u{1F93E 1F3FD 200D 2640 FE0F}` | ✅ | ✅ | ✅ | ✅
+RGI ZWJ Sequence MQE | 🤾🏽‍♀ | `\u{1F93E 1F3FD 200D 2640}` | ❌ | ✅ | ✅ | ✅
+Valid ZWJ Sequence, Non-RGI | 🤠‍🤢 | `\u{1F920 200D 1F922}` | ❌ | ❌  | ✅ | ✅
+Known Region       | 🇵🇹 | `\u{1F1F5 1F1F9}` | ✅ | ✅ | ✅ | ✅
+Unknown Region     | 🇵🇵 | `\u{1F1F5 1F1F5}` | ❌ | ❌  | ❌  | ✅
+RGI Tag Sequence   | 🏴󠁧󠁢󠁳󠁣󠁴󠁿 | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | ✅ | ✅ | ✅ | ✅
+Valid Tag Sequence | 🏴󠁧󠁢󠁡󠁧󠁢󠁿 | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | ❌ | ❌  | ✅ | ✅
+Well-formed Tag Sequence | 😴󠁧󠁢󠁡󠁡󠁡󠁿 | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | ❌ | ❌  | ❌  | ✅
+Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for more details, examples, explanations.
+More info about valid vs. recommended Emoji can also be found in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
-#### Extended Pictographic Regex
+### Extended Pictographic Regex
 `Unicode::Emoji::REGEX_PICTO` matches single codepoints with the **Extended_Pictographic** property. For example, it will match `✀` BLACK SAFETY SCISSORS.
@@ -100,18 +148,13 @@ Regex                         | Description | Example Matches | Example Non-Matc
 See [character.construction/picto](https://character.construction/picto) for a list of all non-Emoji pictographic characters.
-#### Partial Regexes
-Matches potential Emoji parts (often, this is not what you want):
-Regex                         | Description | Example Matches | Example Non-Matches
-------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX_ANY`   | Matches any Emoji-related codepoint (but no variation selectors, tags, or zero-width joiners). Please not that this will match Emoji-parts rather than complete Emoji, for example, single digits! | `😴`, `▶`, `🏻`, `🛌`, `🏽`, `🇵`, `🇹`, `2`, `🏴`, `🤾`, `♀`, `🤠`, `🤢` | -
+### Partial Regexes
+`Unicode::Emoji::REGEX_ANY`, same as `\p{Emoji}`. Deprecated: Will be removed or renamed in the future.
-### List
+## Usage – List
-Use `Unicode::Emoji::LIST` or the list method to get a grouped (and ordered) list of Emoji:
+Use `Unicode::Emoji::LIST` or the **list** method to get a ordered and categorized list of Emoji:
 ```ruby
 Unicode::Emoji.list.keys
@@ -124,13 +167,13 @@ Unicode::Emoji.list("Food & Drink", "food-asian")
 => ["🍱", "🍘", "🍙", "🍚", "🍛", "🍜", "🍝", "🍠", "🍢", "🍣", "🍤", "🍥", "🥮", "🍡", "🥟", "🥠", "🥡"]
 ```
-Please note that categories might change with future versions of the Emoji standard. This gem will issue warnings when attempting to retrieve old categories using the `#list` method.
+Please note that categories might change with future versions of the Emoji standard, although this has not happened often.
 A list of all Emoji (generated from this gem) can be found at [character.construction/emoji](https://character.construction/emoji).
-### Properties
+## Usage – Properties Data
-Allows you to access the codepoint data form Unicode's [emoji-data.txt](https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt) file:
+Allows you to access the codepoint data for a single character form Unicode's [emoji-data.txt](https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt) file:
 ```ruby
 require "unicode/emoji"

data/Rakefile CHANGED Viewed

@@ -28,14 +28,18 @@ task :irb do
 end
 # # #
-# Run Specs
+# Run specs
 desc "#{gemspec.name} | Spec"
 task :spec do
-  ruby "spec/unicode_emoji_spec.rb"
+  ruby File.join("spec", "*_spec.rb")
 end
 task default: :spec
+# # #
+# Generate regex
 desc "#{gemspec.name} | Generates all regex constants and saves them to lib/unicode/emoji/{generated,generated_native} directories"
 task :generate_constants do
   load "data/generate_constants.rb", true

data/data/emoji.marshal.gz CHANGED Viewed

Binary file

data/data/generate_constants.rb CHANGED Viewed

@@ -68,10 +68,10 @@ def pack_and_join(ords)
   end
 end
-def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_component:, emoji_presentation:, picto:, picto_no_emoji:)
+def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_component:, emoji_presentation:, text_presentation:, picto:, picto_no_emoji:)
   emoji_presentation_sequence = \
     join(
-      pack_and_join(TEXT_PRESENTATION) + pack(EMOJI_VARIATION_SELECTOR),
+      text_presentation + pack(EMOJI_VARIATION_SELECTOR),
       emoji_presentation + "(?!" + pack(TEXT_VARIATION_SELECTOR) + ")" + pack(EMOJI_VARIATION_SELECTOR) + "?",
     )
@@ -79,14 +79,20 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
     "(?!" + emoji_component + ")" + emoji_presentation_sequence
   text_keycap_sequence = \
-    join(EMOJI_KEYCAPS.map{|keycap| pack([keycap, EMOJI_KEYCAP_SUFFIX]) })
+    pack_and_join(EMOJI_KEYCAPS) + pack(EMOJI_KEYCAP_SUFFIX)
   text_presentation_sequence = \
     join(
-      pack_and_join(TEXT_PRESENTATION)+ "(?!" + join(emoji_modifier, pack(EMOJI_VARIATION_SELECTOR)) + ")" + pack(TEXT_VARIATION_SELECTOR) + "?",
+      text_presentation + "(?!" + join(emoji_modifier, pack(EMOJI_VARIATION_SELECTOR)) + ")" + pack(TEXT_VARIATION_SELECTOR) + "?",
       emoji_presentation + pack(TEXT_VARIATION_SELECTOR),
     )
+  text_emoji = \
+    join(
+      "(?!" + emoji_component + ")" + text_presentation_sequence,
+      text_keycap_sequence,
+    )
   emoji_modifier_sequence = \
     emoji_modifier_base + emoji_modifier
@@ -99,22 +105,11 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
   emoji_well_formed_flag_sequence = \
     '\p{RI}{2}'
-  emoji_valid_core_sequence = \
-    join(
-      # emoji_character,
-      emoji_keycap_sequence,
-      emoji_modifier_sequence,
-      non_component_emoji_presentation_sequence,
-      emoji_valid_flag_sequence,
-    )
-  emoji_well_formed_core_sequence = \
+  emoji_core_sequence = \
     join(
-      # emoji_character,
       emoji_keycap_sequence,
       emoji_modifier_sequence,
       non_component_emoji_presentation_sequence,
-      emoji_well_formed_flag_sequence,
     )
   # Sort to make sure complex sequences match first
@@ -144,6 +139,18 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
   emoji_rgi_zwj_sequence = \
     pack_and_join(RECOMMENDED_ZWJ_SEQUENCES.sort_by(&:length).reverse)
+  # FQE+MQE: Make VS16 optional after ZWJ has appeared
+  emoji_rgi_include_mqe_zwj_sequence = emoji_rgi_zwj_sequence.gsub(
+      /#{ pack(ZWJ) }[^|]+?\K#{ pack(EMOJI_VARIATION_SELECTOR) }/,
+      pack(EMOJI_VARIATION_SELECTOR) + "?"
+    )
+  # FQE+MQE+UQE: Make all VS16 optional
+  emoji_rgi_include_mqe_uqe_zwj_sequence = emoji_rgi_zwj_sequence.gsub(
+      pack(EMOJI_VARIATION_SELECTOR),
+      pack(EMOJI_VARIATION_SELECTOR) + "?",
+    )
   emoji_valid_zwj_element = \
     join(
       emoji_modifier_sequence,
@@ -160,21 +167,68 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
     join(
       emoji_rgi_zwj_sequence,
       emoji_rgi_tag_sequence,
-      emoji_valid_core_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_rgi_sequence_include_text = \
+    join(
+      emoji_rgi_zwj_sequence,
+      emoji_rgi_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+      text_emoji,
+    )
+  emoji_rgi_include_mqe_sequence = \
+    join(
+      emoji_rgi_include_mqe_zwj_sequence,
+      emoji_rgi_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_rgi_include_mqe_uqe_sequence = \
+    join(
+      emoji_rgi_include_mqe_uqe_zwj_sequence,
+      text_emoji, # also uqe
+      emoji_rgi_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
     )
   emoji_valid_sequence = \
     join(
       emoji_valid_zwj_sequence,
       emoji_valid_tag_sequence,
-      emoji_valid_core_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_valid_sequence_include_text = \
+    join(
+      emoji_valid_zwj_sequence,
+      emoji_valid_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+      text_emoji,
     )
   emoji_well_formed_sequence = \
     join(
       emoji_valid_zwj_sequence,
       emoji_well_formed_tag_sequence,
-      emoji_well_formed_core_sequence,
+      emoji_well_formed_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_well_formed_sequence_include_text = \
+    join(
+      emoji_valid_zwj_sequence,
+      emoji_well_formed_tag_sequence,
+      emoji_well_formed_flag_sequence,
+      emoji_core_sequence,
+      text_emoji,
     )
   emoji_possible_modification = \
@@ -198,41 +252,42 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
   # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
   regexes[:REGEX] = Regexp.compile(emoji_rgi_sequence)
+  # rgi + singleton text
+  regexes[:REGEX_INCLUDE_TEXT] = Regexp.compile(emoji_rgi_sequence_include_text)
+  # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
+  # Also make VS16 optional if not at first emoji character
+  regexes[:REGEX_INCLUDE_MQE] = Regexp.compile(emoji_rgi_include_mqe_sequence)
+  # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
+  # Also make VS16 optional even at first emoji character
+  regexes[:REGEX_INCLUDE_MQE_UQE] = Regexp.compile(emoji_rgi_include_mqe_uqe_sequence)
   # Matches basic singleton emoji and all kind of valid sequences
   regexes[:REGEX_VALID] = Regexp.compile(emoji_valid_sequence)
+  # valid + singleton text
+  regexes[:REGEX_VALID_INCLUDE_TEXT] = Regexp.compile(emoji_valid_sequence_include_text)
   # Matches basic singleton emoji and all kind of sequences
   regexes[:REGEX_WELL_FORMED] = Regexp.compile(emoji_well_formed_sequence)
+  # well-formed + singleton text
+  regexes[:REGEX_WELL_FORMED_INCLUDE_TEXT] = Regexp.compile(emoji_well_formed_sequence_include_text)
   # Quick test which might lead to false positves
   # See https://www.unicode.org/reports/tr51/#EBNF_and_Regex
   regexes[:REGEX_POSSIBLE] = Regexp.compile(emoji_possible)
-  # Matches only basic single, non-textual emoji
-  # Ignores "components" like modifiers or simple digits
-  regexes[:REGEX_BASIC] = Regexp.compile(
-    "(?!" + emoji_component + ")" + emoji_presentation_sequence
-  )
+  # Matches only basic single, non-textual emoji, ignores "components" like modifiers or simple digits
+  regexes[:REGEX_BASIC] = Regexp.compile(non_component_emoji_presentation_sequence)
-  # Matches only basic single, textual emoji
-  # Ignores "components" like modifiers or simple digits
-  regexes[:REGEX_TEXT] = Regexp.compile(
-    join(
-      "(?!" + emoji_component + ")" + text_presentation_sequence,
-      text_keycap_sequence,
-    )
-  )
+  # Matches only basic single, textual emoji, ignores "components" like modifiers or simple digits
+  regexes[:REGEX_TEXT] = Regexp.compile(text_emoji)
-  # Matches any emoji-related codepoint - Use with caution (returns partial matches)
+  # Same as \p{Emoji} - to be removed or renamed
   regexes[:REGEX_ANY] = Regexp.compile(emoji_character)
-  # Combined REGEXes which also match for TEXTUAL emoji
-  regexes[:REGEX_INCLUDE_TEXT] = Regexp.union(regexes[:REGEX], regexes[:REGEX_TEXT])
-  regexes[:REGEX_VALID_INCLUDE_TEXT] = Regexp.union(regexes[:REGEX_VALID], regexes[:REGEX_TEXT])
-  regexes[:REGEX_WELL_FORMED_INCLUDE_TEXT] = Regexp.union(regexes[:REGEX_WELL_FORMED], regexes[:REGEX_TEXT])
   regexes[:REGEX_PICTO] = Regexp.compile(picto)
   regexes[:REGEX_PICTO_NO_EMOJI] = Regexp.compile(picto_no_emoji)
@@ -246,6 +301,7 @@ regexes = compile(
   emoji_modifier_base:  pack_and_join(EMOJI_MODIFIER_BASES),
   emoji_component:      pack_and_join(EMOJI_COMPONENT),
   emoji_presentation:   pack_and_join(EMOJI_PRESENTATION),
+  text_presentation:    pack_and_join(TEXT_PRESENTATION),
   picto:                pack_and_join(EXTENDED_PICTOGRAPHIC),
   picto_no_emoji:       pack_and_join(EXTENDED_PICTOGRAPHIC_NO_EMOJI)
 )
@@ -257,6 +313,7 @@ native_regexes = compile(
   emoji_modifier_base:  "\\p{EBase}",
   emoji_component:      "\\p{EComp}",
   emoji_presentation:   "\\p{EPres}",
+  text_presentation:    "\\p{Emoji}(?<!\\p{EPres})",
   picto:                "\\p{ExtPict}",
   picto_no_emoji:       "\\p{ExtPict}(?<!\\p{Emoji})"
 )

data/lib/unicode/emoji/constants.rb CHANGED Viewed

@@ -2,12 +2,13 @@
 module Unicode
   module Emoji
-    VERSION = "3.7.0"
+    VERSION = "3.8.0"
     EMOJI_VERSION = "16.0"
     CLDR_VERSION = "45"
     DATA_DIRECTORY = File.expand_path('../../../data', __dir__).freeze
     INDEX_FILENAME = (DATA_DIRECTORY + "/emoji.marshal.gz").freeze
+    # Unicode properties, see https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt
     PROPERTY_NAMES = {
       E: "Emoji",
       B: "Emoji_Modifier_Base",
@@ -17,13 +18,28 @@ module Unicode
       X: "Extended_Pictographic",
     }.freeze
+    # Variation Selector 16 (VS16), enables emoji presentation mode for preceding codepoint
     EMOJI_VARIATION_SELECTOR      = 0xFE0F
+    # Variation Selector 15 (VS15), enables text presentation mode for preceding codepoint
     TEXT_VARIATION_SELECTOR       = 0xFE0E
+    # First codepoint of tag-based subdivision flags
     EMOJI_TAG_BASE_FLAG           = 0x1F3F4
+    # Last codepoint of tag-based subdivision flags
     CANCEL_TAG                    = 0xE007F
+    # Tags characters allowed in tag-based subdivision flags
     SPEC_TAGS                     = [*0xE0030..0xE0039, *0xE0061..0xE007A].freeze
+    # Combining Enclosing Keycap character
     EMOJI_KEYCAP_SUFFIX           = 0x20E3
+    # Zero-width-joiner to enable combination of multiple Emoji in a sequence
     ZWJ                           = 0x200D
+    # Two regional indicators make up a region
     REGIONAL_INDICATORS           = [*0x1F1E6..0x1F1FF].freeze
   end
 end