RubyGems - unicode-emoji - Versions diffs - 3.6.0 → 3.8.0 - Mend

unicode-emoji 3.6.0 → 3.8.0

Files changed (38) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/CHANGELOG.md +24 -1
data/README.md +105 -63
data/Rakefile +6 -2
data/data/emoji.marshal.gz +0 -0
data/data/generate_constants.rb +120 -46
data/lib/unicode/emoji/constants.rb +18 -2
data/lib/unicode/emoji/generated/regex.rb +1 -1
data/lib/unicode/emoji/generated/regex_include_mqe.rb +8 -0
data/lib/unicode/emoji/generated/regex_include_mqe_uqe.rb +8 -0
data/lib/unicode/emoji/generated/regex_include_text.rb +1 -1
data/lib/unicode/emoji/generated/regex_possible.rb +8 -0
data/lib/unicode/emoji/generated/regex_text.rb +1 -1
data/lib/unicode/emoji/generated/regex_valid.rb +1 -1
data/lib/unicode/emoji/generated/regex_valid_include_text.rb +1 -1
data/lib/unicode/emoji/generated/regex_well_formed.rb +1 -1
data/lib/unicode/emoji/generated/regex_well_formed_include_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_basic.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_include_mqe.rb +8 -0
data/lib/unicode/emoji/generated_native/regex_include_mqe_uqe.rb +8 -0
data/lib/unicode/emoji/generated_native/regex_include_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_possible.rb +8 -0
data/lib/unicode/emoji/generated_native/regex_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_valid.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_valid_include_text.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_well_formed.rb +1 -1
data/lib/unicode/emoji/generated_native/regex_well_formed_include_text.rb +1 -1
data/lib/unicode/emoji/lazy_constants.rb +37 -6
data/lib/unicode/emoji/list.rb +13 -0
data/lib/unicode/emoji.rb +38 -5
data/spec/data/.keep +0 -0
data/spec/data/emoji-test.txt +5331 -0
data/spec/emoji_test_txt_spec.rb +181 -0
data/spec/unicode_emoji_spec.rb +152 -5
data/unicode-emoji.gemspec +3 -3
metadata +20 -7

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: cc9d3cd1548a2cc0861e05836cc4a5dcc0b4d4d631568c6bc89de094951fcb70
-  data.tar.gz: 2172d2a31051731d234aeae48ba8c6b9cba15181fb8b1886d3a90252048937ab
+  metadata.gz: e3c7cc2671d256d8208b72d719384e7c13aaace4fec6b4919b92640e5336d87f
+  data.tar.gz: 9420777da4805787467c7f4eac1580f7179a6abf56e54b011c11640a50502b88
 SHA512:
-  metadata.gz: ae1d2b65292537a89748d9ba1cc78c08723b37f6ccf5debb9fbeeb73720bd6e1c232f4f1be23d24699e3ebd2f1092646e2877b96366c0d4aa05e0f080aeadf99
-  data.tar.gz: 0d422eeccd45a9430e5df8a5bac8964dca3bd5616eb51d6f6999c3372f8697316e210b04e586ab3a734b5e74b3881a309a2c4a6b0e78964de7513550dc3380ec
+  metadata.gz: f31a83c8a492affe4ec34f2f6e43fa20a44f98dae6c0855322d5c12bc924edc154f7fa4793ca28f299f5436c8981651a5777c2e7944db6dee9c3b99f5ec997ce
+  data.tar.gz: 54c1beecbcb673274bdf98169f11fa51bc746282418c3b9ca30385194ae384e3532b04bbca3ef196316c11d3344258b9059da1ea18630d8d90b51029c608565d

data/.gitignore CHANGED Viewed

@@ -1,2 +1,3 @@
 Gemfile.lock
 /pkg
+/spec/data/emoji-test.txt

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,28 @@
 # CHANGELOG
+### 3.8.0
+- Add new RGI-based regexes `REGEX_INCLUDE_MQE` and `REGEX_INCLUDE_MQE_UQE` which allows to match
+  for minimally-qualified and unqualified RGI sequences (Emoji that lack some VS16)
+- Add specs running through `emoji-text.txt` and classify qualification statuses per regex
+- Improve documentation and add detailed table about which regex has which features
+- Native regexes: Use native Emoji props for Emoji text presentation
+- Update CLDR to v46 (valid subdivisions)
+- Further improvements (see commit log)
+### 3.7.0
+- Bump required Ruby slightly to 2.5
+- Introduce new `REGEX_POSSIBLE` which contains the regex described in
+  https://www.unicode.org/reports/tr51/#EBNF_and_Regex
+- Fix that some valid subdivisions were not decompressed (`REGEX_VALID`)
+- Be stricter about selection of tag characters in `REGEX_WELL_FORMED`
+  - Only U+E0030..U+E0039, U+E0061..U+E007A allowed
+  - Max tag sequence length
+- Use native `/\p{RI}/` regex for regional indicators
+- Separately autoload emoji list, so it can be loaded when other indexes
+  are not needed
 ### 3.6.0
 - `Unicode::Emoji::REGEX_TEXT` now matches non-emoji keycaps like "3⃣"  (U+0033 U+20E3)
@@ -16,7 +39,7 @@
 ### 3.3.2
 - Update valid subdivisions to CLDR 43 (no changes)
-  -> there won't be any new subdivision flags in Emoji
+  -> there won't be any new RGI subdivision flags in Emoji
 ### 3.3.1

data/README.md CHANGED Viewed

@@ -1,18 +1,15 @@
 # Unicode::Emoji [![[version]](https://badge.fury.io/rb/unicode-emoji.svg)](https://badge.fury.io/rb/unicode-emoji)  [![[ci]](https://github.com/janlelis/unicode-emoji/workflows/Test/badge.svg)](https://github.com/janlelis/unicode-emoji/actions?query=workflow%3ATest)
-Provides Unicode Emoji data and regexes, incorporating the latest Unicode and Emoji standards.
+Provides regular expressions to find Emoji in strings, incorporating the latest Unicode / Emoji standards.
-Also includes a categorized list of recommended Emoji.
+Additional features:
-Emoji version: **16.0** (September 2024)
-CLDR version (used for sub-region flags): **45** (April 2024)
-Supported Rubies: **3.x**
+- A categorized list of Emoji (RGI: Recommended for General Interchange)
+- Retrieve Emoji properties info about specific codepoints (Emoji_Modifier, Emoji_Presentation, etc.)
-No longer supported Rubies, but might still work: **2.7**, **2.6**, **2.5**, **2.4**, **2.3**
+Emoji version: **16.0** (September 2024)
-If you are stuck on an older Ruby version, checkout the latest [0.9 version](https://rubygems.org/gems/unicode-emoji/versions/0.9.3) of this gem.
+CLDR version (used for sub-region flags): **46** (October 2024)
 ## Gemfile
@@ -20,16 +17,14 @@ If you are stuck on an older Ruby version, checkout the latest [0.9 version](htt
 gem "unicode-emoji"
 ```
-## Usage
-### Regex
+## Usage – Regex Matching
-The gem includes a bunch of Emoji regexes, which are compiled out of various Emoji Unicode data sources.
+The gem includes multiple Emoji regexes, which are compiled out of various Emoji Unicode data sources.
 ```ruby
 require "unicode/emoji"
-string = "String which contains all kinds of emoji:
+string = "String which contains all types of Emoji sequences:
 - Singleton Emoji: 😴
 - Textual singleton Emoji with Emoji variation: ▶️
@@ -38,62 +33,114 @@ string = "String which contains all kinds of emoji:
 - Sub-Region flag: 🏴󠁧󠁢󠁳󠁣󠁴󠁿
 - Keycap sequence: 2️⃣
 - Sequence using ZWJ (zero width joiner): 🤾🏽‍♀️
 "
 string.scan(Unicode::Emoji::REGEX) # => ["😴", "▶️", "🛌🏽", "🇵🇹", "🏴󠁧󠁢󠁳󠁣󠁴󠁿", "2️⃣", "🤾🏽‍♀️"]
 ```
-#### Main Regexes
+Depending on your exact usecase, you can choose between multiple levels of Emoji detection:
-Matches (non-textual) Emoji of all kinds:
+### Main Regexes
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX`       | **Use this if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of *recommended* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
-`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢` | `😴︎`, `▶`, `🏻`, `🇵🇵`
-`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kind of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵` | `😴︎`, `▶`, `🏻`
+`Unicode::Emoji::REGEX`       | **Use this one if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *recommended* Emoji sequences (RGI/FQE) | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️` |  `🤾🏽‍♀`, `🏌‍♂️`, `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *valid* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀` ,`🏌‍♂️`, `🤠‍🤢` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`,`🏌‍♂️` , `🤠‍🤢`,  `🇵🇵` | `😴︎`, `▶`, `🏻`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits (except for: unqualified keycap sequences) | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶`, `🏻`, `1` | `1⃣`
-##### Picking the Right Emoji Regex
+#### Include Text Emoji
-- Usually you just want `REGEX` (RGI set)
-- If you want broader matching (e.g. more sub-regions), choose `REGEX_VALID`
-- If you even want to match for invalid sequences, too, use `REGEX_WELL_FORMED`
+By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
-Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for details.
-Property | `REGEX` (RGI / Recommended) | `REGEX_VALID` (Valid) | `REGEX_WELL_FORMED` (Well-formed)
----------|-----------------------------|-----------------------|----------------------------------
-Region "🇵🇹"                    | Yes | Yes | Yes
-Region "🇵🇵"                   | No  | No  | Yes
-Tag Sequence "🏴󠁧󠁢󠁳󠁣󠁴󠁿"              | Yes | Yes | Yes
-Tag Sequence "🏴󠁧󠁢󠁡󠁧󠁢󠁿"              | No  | Yes | Yes
-Tag Sequence "😴󠁧󠁢󠁡󠁡󠁡󠁿"              | No  | No  | Yes
-ZWJ Sequence "🤾🏽‍♀️"           | Yes | Yes | Yes
-ZWJ Sequence "🤠‍🤢"            | No  | Yes | Yes
-More info about valid vs. recommended Emoji in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
-#### Singleton Regexes
+Regex                         | Description | Example Matches | Example Non-Matches
+------------------------------|-------------|-----------------|--------------------
+`Unicode::Emoji::REGEX_INCLUDE_TEXT`       | `REGEX` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `😴︎`, `▶`, `1⃣` | `🤾🏽‍♀`, `🏌‍♂️`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`
+`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`, `😴︎`, `▶`, `1⃣` | `🏻`, `🇵🇵`, `1`
+`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶`, `1⃣` | `🏻`, `1`
-Matches only simple one-codepoint (+ optional variation selector) Emoji:
+#### Minimally-qualified and Unqualified Sequences
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | `😴`, `▶️` | `😴︎`, `▶`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`
-`Unicode::Emoji::REGEX_TEXT`  | Matches only textual singleton Emoji (except for singleton components, like digit 1) | `😴︎`, `▶` | `😴`, `▶️`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`
+`Unicode::Emoji::REGEX_INCLUDE_MQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors, where the first partial Emoji has all required Variation Selectors | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀` | `🏌‍♂️`, `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`, `1⃣`
+`Unicode::Emoji::REGEX_INCLUDE_MQE_UQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️` | `😴︎`, `▶`, `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`, `1`, `1⃣`
-#### Include Textual Emoji
+[List of MQE and UQE Emoji sequences](https://character.construction/unqualified-emoji)
-By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes. However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix:
+#### Singleton Regexes
+Matches only simple one-codepoint (+ optional variation selector) Emoji:
 Regex                         | Description | Example Matches | Example Non-Matches
 ------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX_INCLUDE_TEXT`       | `REGEX` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🤾🏽‍♀️`, `😴︎`, `▶` | `🏻`, `🇵🇵`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤠‍🤢`
-`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`, `😴︎`, `▶` | `🏻`, `🇵🇵`
-`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `😴`, `▶️`, `🛌🏽`, `🇵🇹`, `2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤠‍🤢`,  `🇵🇵`, `😴︎`, `▶` | `🏻`
-#### Extended Pictographic Regex
+`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | `😴`, `▶️` | `😴︎`, `▶`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`, `1`
+`Unicode::Emoji::REGEX_TEXT`  | Matches only textual singleton Emoji (except for singleton components, like digits) | `😴︎`, `▶` | `😴`, `▶️`, `🏻`, `🛌🏽`, `🇵🇹`, `🇵🇵`,`2️⃣`, `🏴󠁧󠁢󠁳󠁣󠁴󠁿`, `🏴󠁧󠁢󠁡󠁧󠁢󠁿`, `🤾🏽‍♀️`, `🤾🏽‍♀`, `🏌‍♂️`, `🤠‍🤢`, `1`
+Here is a list of all Emoji that can be matched using the two regexes: [character.construction/emoji-vs-text](https://character.construction/emoji-vs-text)
+While `REGEX_BASIC` is part of the above regexes, `REGEX_TEXT` is only included in the `*_INCLUDE_TEXT` or `*_UQE` variants.
+### Comparison
+1) Fully-qualified RGI Emoji ZWJ sequence
+2) Minimally-qualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selectors, but not in the first Emoji character)
+3) Unqualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selector, including in the first Emoji character). Unqualified Emoji include all basic Emoji in Text Presentation (see column 11/12).
+4) Non-RGI Emoji ZWJ sequence
+5) Valid Region made from a pair of Regional Indicators
+6) Any Region made from a pair of Regional Indicators
+7) RGI Flag Emoji Tag Sequences (England, Scotland, Wales)
+8) Valid Flag Emoji Tag Sequences (any known subdivision)
+9) Any Emoji Tag Sequences (any tag sequence with any base)
+10) Basic Default Emoji Presentation Characters or Text characters with Emoji Presentation Selector
+11) Basic Default Text Presentation Characters or Basic Emoji with Text Presentation Selector
+12) Non-Emoji (unqualified) keycap
+Regex | 1 RGI/FQE | 2 RGI/MQE | 3 RGI/UQE | 4 Non-RGI | 5 Valid Region | 6 Any Region | 7 RGI Tag | 8 Valid Tag | 9 Any Tag | 10 Basic Emoji | 11 Basic Text | 12 Text Keycap
+-|-|-|-|-|-|-|-|-|-|-|-|-
+REGEX                          | ✅ | ❌ | ❌    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌
+REGEX INCLUDE TEXT             | ✅ | ❌ | ❌    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅
+REGEX INCLUDE MQE              | ✅ | ✅ | ❌    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌
+REGEX INCLUDE MQE UQE          | ✅ | ✅ | ✅    | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅
+REGEX VALID                    | ✅ | ✅ | (✅)¹ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌
+REGEX VALID INCLUDE TEXT       | ✅ | ✅ | ✅    | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅
+REGEX WELL FORMED              | ✅ | ✅ | (✅)¹ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌
+REGEX WELL FORMED INCLUDE TEXT | ✅ | ✅ | ✅    | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅
+REGEX POSSIBLE                 | ✅ | ✅ | ✅    | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌
+REGEX BASIC                    | ❌ | ❌ | ❌    | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌
+REGEX TEXT                     | ❌ | ❌ | ❌    | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅
+¹ Matches all unqualified Emoji, except for textual singleton Emoji (see columns 11, 12)
+See [spec files](/spec) for detailed examples about which regex matches which kind of Emoji.
+### Picking the Right Emoji Regex
+- Usually you just want `REGEX` (recommended Emoji set, RGI)
+- Use `REGEX_INCLUDE_MQE` or `REGEX_INCLUDE_MQE_UQE` if you want to catch Emoji sequences with missing Variation Selectors.
+- If you want broader matching (any ZWJ sequences, more sub-region flags), choose `REGEX_VALID`
+- If you need to match any region flag and any tag sequence, choose `REGEX_WELL_FORMED`
+- Use the `_INCLUDE_TEXT` suffix with any of the above base regexes, if you want to also match basic textual Emoji
+- And finally, there is also the option to use `REGEX_POSSIBLE`, which is a simplified test for possible Emoji, comparable to `REGEX_WELL_FORMED*`. It might contain false positives, however, the regex is less complex and [suggested in the Unicode standard itself](https://www.unicode.org/reports/tr51/#EBNF_and_Regex) as a first check.
+### Examples
+Desc | Emoji | Escaped | `REGEX` (RGI/FQE) | `REGEX_INCLUDE_MQE` (RGI/MQE) | `REGEX_VALID` | `REGEX_WELL_FORMED` / `REGEX_POSSIBLE`
+-----|-------|---------|---------------|-----------------------|-----------------------------------|-----------------
+RGI ZWJ Sequence   | 🤾🏽‍♀️ | `\u{1F93E 1F3FD 200D 2640 FE0F}` | ✅ | ✅ | ✅ | ✅
+RGI ZWJ Sequence MQE | 🤾🏽‍♀ | `\u{1F93E 1F3FD 200D 2640}` | ❌ | ✅ | ✅ | ✅
+Valid ZWJ Sequence, Non-RGI | 🤠‍🤢 | `\u{1F920 200D 1F922}` | ❌ | ❌  | ✅ | ✅
+Known Region       | 🇵🇹 | `\u{1F1F5 1F1F9}` | ✅ | ✅ | ✅ | ✅
+Unknown Region     | 🇵🇵 | `\u{1F1F5 1F1F5}` | ❌ | ❌  | ❌  | ✅
+RGI Tag Sequence   | 🏴󠁧󠁢󠁳󠁣󠁴󠁿 | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | ✅ | ✅ | ✅ | ✅
+Valid Tag Sequence | 🏴󠁧󠁢󠁡󠁧󠁢󠁿 | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | ❌ | ❌  | ✅ | ✅
+Well-formed Tag Sequence | 😴󠁧󠁢󠁡󠁡󠁡󠁿 | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | ❌ | ❌  | ❌  | ✅
+Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for more details, examples, explanations.
+More info about valid vs. recommended Emoji can also be found in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/).
+### Extended Pictographic Regex
 `Unicode::Emoji::REGEX_PICTO` matches single codepoints with the **Extended_Pictographic** property. For example, it will match `✀` BLACK SAFETY SCISSORS.
@@ -101,18 +148,13 @@ Regex                         | Description | Example Matches | Example Non-Matc
 See [character.construction/picto](https://character.construction/picto) for a list of all non-Emoji pictographic characters.
-#### Partial Regexes
-Matches potential Emoji parts (often, this is not what you want):
-Regex                         | Description | Example Matches | Example Non-Matches
-------------------------------|-------------|-----------------|--------------------
-`Unicode::Emoji::REGEX_ANY`   | Matches any Emoji-related codepoint (but no variation selectors, tags, or zero-width joiners). Please not that this will match Emoji-parts rather than complete Emoji, for example, single digits! | `😴`, `▶`, `🏻`, `🛌`, `🏽`, `🇵`, `🇹`, `2`, `🏴`, `🤾`, `♀`, `🤠`, `🤢` | -
+### Partial Regexes
+`Unicode::Emoji::REGEX_ANY`, same as `\p{Emoji}`. Deprecated: Will be removed or renamed in the future.
-### List
+## Usage – List
-Use `Unicode::Emoji::LIST` or the list method to get a grouped (and ordered) list of Emoji:
+Use `Unicode::Emoji::LIST` or the **list** method to get a ordered and categorized list of Emoji:
 ```ruby
 Unicode::Emoji.list.keys
@@ -125,13 +167,13 @@ Unicode::Emoji.list("Food & Drink", "food-asian")
 => ["🍱", "🍘", "🍙", "🍚", "🍛", "🍜", "🍝", "🍠", "🍢", "🍣", "🍤", "🍥", "🥮", "🍡", "🥟", "🥠", "🥡"]
 ```
-Please note that categories might change with future versions of the Emoji standard. This gem will issue warnings when attempting to retrieve old categories using the `#list` method.
+Please note that categories might change with future versions of the Emoji standard, although this has not happened often.
-A list of all Emoji can be found at [character.construction](https://character.construction).
+A list of all Emoji (generated from this gem) can be found at [character.construction/emoji](https://character.construction/emoji).
-### Properties
+## Usage – Properties Data
-Allows you to access the codepoint data form Unicode's [emoji-data.txt](https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt) file:
+Allows you to access the codepoint data for a single character form Unicode's [emoji-data.txt](https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt) file:
 ```ruby
 require "unicode/emoji"
@@ -143,7 +185,7 @@ Unicode::Emoji.properties "☝" # => ["Emoji", "Emoji_Modifier_Base"]
 - [Unicode® Technical Standard #51](https://www.unicode.org/reports/tr51/)
 - [Emoji categories](https://unicode.org/emoji/charts/emoji-ordering.html)
-- Ruby gem which displays [Emoji sequence names](https://github.com/janlelis/unicode-sequence_name) (here [as website](https://character.construction/name))
+- Ruby gem which displays [Emoji sequence names](https://github.com/janlelis/unicode-sequence_name) ([as website](https://character.construction/name))
 - Part of [unicode-x](https://github.com/janlelis/unicode-x)
 ## MIT

data/Rakefile CHANGED Viewed

@@ -28,14 +28,18 @@ task :irb do
 end
 # # #
-# Run Specs
+# Run specs
 desc "#{gemspec.name} | Spec"
 task :spec do
-  ruby "spec/unicode_emoji_spec.rb"
+  ruby File.join("spec", "*_spec.rb")
 end
 task default: :spec
+# # #
+# Generate regex
 desc "#{gemspec.name} | Generates all regex constants and saves them to lib/unicode/emoji/{generated,generated_native} directories"
 task :generate_constants do
   load "data/generate_constants.rb", true

data/data/emoji.marshal.gz CHANGED Viewed

Binary file

data/data/generate_constants.rb CHANGED Viewed

@@ -68,10 +68,10 @@ def pack_and_join(ords)
   end
 end
-def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_component:, emoji_presentation:, picto:, picto_no_emoji:)
+def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_component:, emoji_presentation:, text_presentation:, picto:, picto_no_emoji:)
   emoji_presentation_sequence = \
     join(
-      pack_and_join(TEXT_PRESENTATION) + pack(EMOJI_VARIATION_SELECTOR),
+      text_presentation + pack(EMOJI_VARIATION_SELECTOR),
       emoji_presentation + "(?!" + pack(TEXT_VARIATION_SELECTOR) + ")" + pack(EMOJI_VARIATION_SELECTOR) + "?",
     )
@@ -79,14 +79,20 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
     "(?!" + emoji_component + ")" + emoji_presentation_sequence
   text_keycap_sequence = \
-    join(EMOJI_KEYCAPS.map{|keycap| pack([keycap, EMOJI_KEYCAP_SUFFIX]) })
+    pack_and_join(EMOJI_KEYCAPS) + pack(EMOJI_KEYCAP_SUFFIX)
   text_presentation_sequence = \
     join(
-      pack_and_join(TEXT_PRESENTATION)+ "(?!" + join(emoji_modifier, pack(EMOJI_VARIATION_SELECTOR)) + ")" + pack(TEXT_VARIATION_SELECTOR) + "?",
+      text_presentation + "(?!" + join(emoji_modifier, pack(EMOJI_VARIATION_SELECTOR)) + ")" + pack(TEXT_VARIATION_SELECTOR) + "?",
       emoji_presentation + pack(TEXT_VARIATION_SELECTOR),
     )
+  text_emoji = \
+    join(
+      "(?!" + emoji_component + ")" + text_presentation_sequence,
+      text_keycap_sequence,
+    )
   emoji_modifier_sequence = \
     emoji_modifier_base + emoji_modifier
@@ -97,27 +103,13 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
     pack_and_join(VALID_REGION_FLAGS)
   emoji_well_formed_flag_sequence = \
-    "(?:" +
-      pack_and_join(REGIONAL_INDICATORS) +
-      pack_and_join(REGIONAL_INDICATORS) +
-    ")"
-  emoji_valid_core_sequence = \
-    join(
-      # emoji_character,
-      emoji_keycap_sequence,
-      emoji_modifier_sequence,
-      non_component_emoji_presentation_sequence,
-      emoji_valid_flag_sequence,
-    )
+    '\p{RI}{2}'
-  emoji_well_formed_core_sequence = \
+  emoji_core_sequence = \
     join(
-      # emoji_character,
       emoji_keycap_sequence,
       emoji_modifier_sequence,
       non_component_emoji_presentation_sequence,
-      emoji_well_formed_flag_sequence,
     )
   # Sort to make sure complex sequences match first
@@ -128,7 +120,7 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
     "(?:" +
       pack(EMOJI_TAG_BASE_FLAG) +
       "(?:" + VALID_SUBDIVISIONS.sort_by(&:length).reverse.map{ |sd|
-        Regexp.escape(sd.tr("\u{20}-\u{7E}", "\u{E0020}-\u{E007E}"))
+        sd.tr("\u{30}-\u{39}\u{61}-\u{7A}", "\u{E0030}-\u{E0039}\u{E0061}-\u{E007A}")
       }.join("|") + ")" +
       pack(CANCEL_TAG) +
     ")"
@@ -139,7 +131,7 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
         non_component_emoji_presentation_sequence,
         emoji_modifier_sequence,
       ) +
-      pack_and_join(TAGS) + "+" +
+      pack_and_join(SPEC_TAGS) + "{1,30}" +
       pack(CANCEL_TAG) +
     ")"
@@ -147,6 +139,18 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
   emoji_rgi_zwj_sequence = \
     pack_and_join(RECOMMENDED_ZWJ_SEQUENCES.sort_by(&:length).reverse)
+  # FQE+MQE: Make VS16 optional after ZWJ has appeared
+  emoji_rgi_include_mqe_zwj_sequence = emoji_rgi_zwj_sequence.gsub(
+      /#{ pack(ZWJ) }[^|]+?\K#{ pack(EMOJI_VARIATION_SELECTOR) }/,
+      pack(EMOJI_VARIATION_SELECTOR) + "?"
+    )
+  # FQE+MQE+UQE: Make all VS16 optional
+  emoji_rgi_include_mqe_uqe_zwj_sequence = emoji_rgi_zwj_sequence.gsub(
+      pack(EMOJI_VARIATION_SELECTOR),
+      pack(EMOJI_VARIATION_SELECTOR) + "?",
+    )
   emoji_valid_zwj_element = \
     join(
       emoji_modifier_sequence,
@@ -163,58 +167,126 @@ def compile(emoji_character:, emoji_modifier:, emoji_modifier_base:, emoji_compo
     join(
       emoji_rgi_zwj_sequence,
       emoji_rgi_tag_sequence,
-      emoji_valid_core_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_rgi_sequence_include_text = \
+    join(
+      emoji_rgi_zwj_sequence,
+      emoji_rgi_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+      text_emoji,
+    )
+  emoji_rgi_include_mqe_sequence = \
+    join(
+      emoji_rgi_include_mqe_zwj_sequence,
+      emoji_rgi_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_rgi_include_mqe_uqe_sequence = \
+    join(
+      emoji_rgi_include_mqe_uqe_zwj_sequence,
+      text_emoji, # also uqe
+      emoji_rgi_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
     )
   emoji_valid_sequence = \
     join(
       emoji_valid_zwj_sequence,
       emoji_valid_tag_sequence,
-      emoji_valid_core_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_valid_sequence_include_text = \
+    join(
+      emoji_valid_zwj_sequence,
+      emoji_valid_tag_sequence,
+      emoji_valid_flag_sequence,
+      emoji_core_sequence,
+      text_emoji,
     )
   emoji_well_formed_sequence = \
     join(
       emoji_valid_zwj_sequence,
       emoji_well_formed_tag_sequence,
-      emoji_well_formed_core_sequence,
+      emoji_well_formed_flag_sequence,
+      emoji_core_sequence,
+    )
+  emoji_well_formed_sequence_include_text = \
+    join(
+      emoji_valid_zwj_sequence,
+      emoji_well_formed_tag_sequence,
+      emoji_well_formed_flag_sequence,
+      emoji_core_sequence,
+      text_emoji,
+    )
+  emoji_possible_modification = \
+    join(
+      emoji_modifier,
+      pack([EMOJI_VARIATION_SELECTOR, EMOJI_KEYCAP_SUFFIX]) + "?",
+      "[󠀠-󠁾]+󠁿" # raw tags
+    )
+  emoji_possible_zwj_element = \
+    join(
+      emoji_well_formed_flag_sequence,
+      emoji_character + emoji_possible_modification + "?"
     )
+  emoji_possible = \
+    emoji_possible_zwj_element + "(?:" + pack(ZWJ) + emoji_possible_zwj_element + ")*"
   regexes = {}
   # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
   regexes[:REGEX] = Regexp.compile(emoji_rgi_sequence)
+  # rgi + singleton text
+  regexes[:REGEX_INCLUDE_TEXT] = Regexp.compile(emoji_rgi_sequence_include_text)
+  # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
+  # Also make VS16 optional if not at first emoji character
+  regexes[:REGEX_INCLUDE_MQE] = Regexp.compile(emoji_rgi_include_mqe_sequence)
+  # Matches basic singleton emoji and all kind of sequences, but restrict zwj and tag sequences to known sequences (rgi)
+  # Also make VS16 optional even at first emoji character
+  regexes[:REGEX_INCLUDE_MQE_UQE] = Regexp.compile(emoji_rgi_include_mqe_uqe_sequence)
   # Matches basic singleton emoji and all kind of valid sequences
   regexes[:REGEX_VALID] = Regexp.compile(emoji_valid_sequence)
+  # valid + singleton text
+  regexes[:REGEX_VALID_INCLUDE_TEXT] = Regexp.compile(emoji_valid_sequence_include_text)
   # Matches basic singleton emoji and all kind of sequences
   regexes[:REGEX_WELL_FORMED] = Regexp.compile(emoji_well_formed_sequence)
+  # well-formed + singleton text
+  regexes[:REGEX_WELL_FORMED_INCLUDE_TEXT] = Regexp.compile(emoji_well_formed_sequence_include_text)
-  # Matches only basic single, non-textual emoji
-  # Ignores "components" like modifiers or simple digits
-  regexes[:REGEX_BASIC] = Regexp.compile(
-    "(?!" + emoji_component + ")" + emoji_presentation_sequence
-  )
+  # Quick test which might lead to false positves
+  # See https://www.unicode.org/reports/tr51/#EBNF_and_Regex
+  regexes[:REGEX_POSSIBLE] = Regexp.compile(emoji_possible)
-  # Matches only basic single, textual emoji
-  # Ignores "components" like modifiers or simple digits
-  regexes[:REGEX_TEXT] = Regexp.compile(
-    join(
-      "(?!" + emoji_component + ")" + text_presentation_sequence,
-      text_keycap_sequence,
-    )
-  )
+  # Matches only basic single, non-textual emoji, ignores "components" like modifiers or simple digits
+  regexes[:REGEX_BASIC] = Regexp.compile(non_component_emoji_presentation_sequence)
-  # Matches any emoji-related codepoint - Use with caution (returns partial matches)
-  regexes[:REGEX_ANY] = Regexp.compile(emoji_character)
-  # Combined REGEXes which also match for TEXTUAL emoji
-  regexes[:REGEX_INCLUDE_TEXT] = Regexp.union(regexes[:REGEX], regexes[:REGEX_TEXT])
+  # Matches only basic single, textual emoji, ignores "components" like modifiers or simple digits
+  regexes[:REGEX_TEXT] = Regexp.compile(text_emoji)
-  regexes[:REGEX_VALID_INCLUDE_TEXT] = Regexp.union(regexes[:REGEX_VALID], regexes[:REGEX_TEXT])
-  regexes[:REGEX_WELL_FORMED_INCLUDE_TEXT] = Regexp.union(regexes[:REGEX_WELL_FORMED], regexes[:REGEX_TEXT])
+  # Same as \p{Emoji} - to be removed or renamed
+  regexes[:REGEX_ANY] = Regexp.compile(emoji_character)
   regexes[:REGEX_PICTO] = Regexp.compile(picto)
@@ -229,6 +301,7 @@ regexes = compile(
   emoji_modifier_base:  pack_and_join(EMOJI_MODIFIER_BASES),
   emoji_component:      pack_and_join(EMOJI_COMPONENT),
   emoji_presentation:   pack_and_join(EMOJI_PRESENTATION),
+  text_presentation:    pack_and_join(TEXT_PRESENTATION),
   picto:                pack_and_join(EXTENDED_PICTOGRAPHIC),
   picto_no_emoji:       pack_and_join(EXTENDED_PICTOGRAPHIC_NO_EMOJI)
 )
@@ -240,6 +313,7 @@ native_regexes = compile(
   emoji_modifier_base:  "\\p{EBase}",
   emoji_component:      "\\p{EComp}",
   emoji_presentation:   "\\p{EPres}",
+  text_presentation:    "\\p{Emoji}(?<!\\p{EPres})",
   picto:                "\\p{ExtPict}",
   picto_no_emoji:       "\\p{ExtPict}(?<!\\p{Emoji})"
 )

data/lib/unicode/emoji/constants.rb CHANGED Viewed

@@ -2,12 +2,13 @@
 module Unicode
   module Emoji
-    VERSION = "3.6.0"
+    VERSION = "3.8.0"
     EMOJI_VERSION = "16.0"
     CLDR_VERSION = "45"
     DATA_DIRECTORY = File.expand_path('../../../data', __dir__).freeze
     INDEX_FILENAME = (DATA_DIRECTORY + "/emoji.marshal.gz").freeze
+    # Unicode properties, see https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt
     PROPERTY_NAMES = {
       E: "Emoji",
       B: "Emoji_Modifier_Base",
@@ -17,13 +18,28 @@ module Unicode
       X: "Extended_Pictographic",
     }.freeze
+    # Variation Selector 16 (VS16), enables emoji presentation mode for preceding codepoint
     EMOJI_VARIATION_SELECTOR      = 0xFE0F
+    # Variation Selector 15 (VS15), enables text presentation mode for preceding codepoint
     TEXT_VARIATION_SELECTOR       = 0xFE0E
+    # First codepoint of tag-based subdivision flags
     EMOJI_TAG_BASE_FLAG           = 0x1F3F4
+    # Last codepoint of tag-based subdivision flags
     CANCEL_TAG                    = 0xE007F
-    TAGS                          = [*0xE0020..0xE007E].freeze
+    # Tags characters allowed in tag-based subdivision flags
+    SPEC_TAGS                     = [*0xE0030..0xE0039, *0xE0061..0xE007A].freeze
+    # Combining Enclosing Keycap character
     EMOJI_KEYCAP_SUFFIX           = 0x20E3
+    # Zero-width-joiner to enable combination of multiple Emoji in a sequence
     ZWJ                           = 0x200D
+    # Two regional indicators make up a region
     REGIONAL_INDICATORS           = [*0x1F1E6..0x1F1FF].freeze
   end
 end