micromark-extension-cjk-friendly-util 2.1.0 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +15 -15
- package/dist/categoryUtil.d.ts +7 -8
- package/dist/categoryUtil.js +55 -34
- package/dist/characterWithNonBmp.d.ts +5 -3
- package/dist/characterWithNonBmp.js +88 -36
- package/dist/classifyCharacter.d.ts +13 -12
- package/dist/classifyCharacter.js +60 -95
- package/dist/codeUtil.d.ts +27 -25
- package/dist/codeUtil.js +112 -92
- package/dist/index.d.ts +4 -5
- package/dist/index.js +4 -230
- package/package.json +7 -9
package/README.md
CHANGED
|
@@ -1,10 +1,10 @@
|
|
|
1
1
|
# micromark-extension-cjk-friendly-util
|
|
2
2
|
|
|
3
|
-
[](https://npmjs.com/package/micromark-extension-cjk-friendly-util)  [](https://npmjs.com/package/micromark-extension-cjk-friendly-util) [](https://npmjs.com/package/micromark-extension-cjk-friendly-util)
|
|
3
|
+
[](https://npmjs.com/package/micromark-extension-cjk-friendly-util)  [](https://npmjs.com/package/micromark-extension-cjk-friendly-util) [](https://npmjs.com/package/micromark-extension-cjk-friendly-util) [](https://socket.dev/npm/package/micromark-extension-cjk-friendly-util) [](https://snyk.io/advisor/npm-package/micromark-extension-cjk-friendly-util)
|
|
4
4
|
|
|
5
5
|
An utility library package for [micromark-extension-cjk-friendly](https://npmjs.com/package/micromark-extension-cjk-friendly), which is internally used by [remark-cjk-friendly](https://npmjs.com/package/remark-cjk-friendly), and its related packages.
|
|
6
6
|
|
|
7
|
-
## Problem / <span lang="ja">問題</span> / <span lang="zh-Hans-CN">问题</span> / <span lang="ko"
|
|
7
|
+
## Problem / <span lang="ja">問題</span> / <span lang="zh-Hans-CN">问题</span> / <span lang="ko">문제</span>
|
|
8
8
|
|
|
9
9
|
CommonMark has a problem that the following emphasis marks `**` are not recognized as emphasis marks in Japanese, Chinese, and Korean.
|
|
10
10
|
|
|
@@ -12,7 +12,7 @@ CommonMark has a problem that the following emphasis marks `**` are not recogniz
|
|
|
12
12
|
|
|
13
13
|
<span lang="zh-Hans-CN">CommonMark存在以下问题:在中文、日语和韩语文本中,强调标记`**`不会被识别为强调标记。</span>
|
|
14
14
|
|
|
15
|
-
<span lang="ko">CommonMark는
|
|
15
|
+
<span lang="ko">CommonMark는 한국어, 일본어, 중국어에서 다음과 같은 강조 표시 `**`가 강조 표시로 인식되지 않는 문제가 있습니다.</span>
|
|
16
16
|
|
|
17
17
|
```md
|
|
18
18
|
**このアスタリスクは強調記号として認識されず、そのまま表示されます。**この文のせいで。
|
|
@@ -40,15 +40,15 @@ Of course, not only the end side but also the start side has the same issue.
|
|
|
40
40
|
|
|
41
41
|
CommonMark issue: https://github.com/commonmark/commonmark-spec/issues/650
|
|
42
42
|
|
|
43
|
-
## Runtime Requirements / <span lang="ja">実行環境の要件</span> / <span lang="zh-Hans-CN">运行环境要求</span> / <span lang="ko"
|
|
43
|
+
## Runtime Requirements / <span lang="ja">実行環境の要件</span> / <span lang="zh-Hans-CN">运行环境要求</span> / <span lang="ko">런타임 요구 사항</span>
|
|
44
44
|
|
|
45
|
-
This package is ESM-only. It requires Node.js
|
|
45
|
+
This package is ESM-only. It requires Node.js 18 or later. (I have only tested it on 20 and later. There is no factor that would prevent it from working on 18, but I do not guarantee its operation on 18.)
|
|
46
46
|
|
|
47
|
-
<span lang="ja">本パッケージはESM専用です。Node.js
|
|
47
|
+
<span lang="ja">本パッケージはESM専用です。Node.js 18以上が必要です。(動作検証は20以降でのみ行っています。18での動作を妨げる要因はありませんが、動作の保証はありません)</span>
|
|
48
48
|
|
|
49
|
-
<span lang="zh-Hans-CN">此包仅支持ESM。需要Node.js
|
|
49
|
+
<span lang="zh-Hans-CN">此包仅支持ESM。需要Node.js 18或更高版本。(我只测试了20及以后的版本。没有因素会阻止它在18上工作,但我不保证在18上的操作。)</span>
|
|
50
50
|
|
|
51
|
-
<span lang="ko"
|
|
51
|
+
<span lang="ko">본 패키지는 ESM 전용입니다. Node.js 18 이상이 필요합니다. (동작 검증은 20 이후 버전에서만 수행했습니다. 18에서 동작을 방해하는 요인은 없으나, 동작을 보장하지는 않습니다)</span>
|
|
52
52
|
|
|
53
53
|
## Installation / <span lang="ja">インストール</span> / <span lang="zh-Hans-CN">安装</span> / <span lang="ko">설치</span>
|
|
54
54
|
|
|
@@ -70,7 +70,7 @@ If you use another package manager, please replace `npm install` with the comman
|
|
|
70
70
|
|
|
71
71
|
<span lang="zh-Hans-CN">如果使用其他包管理器,请将 `npm install` 替换为当时包管理器的命令(例如:`pnpm add`、`yarn add`)。</span>
|
|
72
72
|
|
|
73
|
-
<span lang="ko"
|
|
73
|
+
<span lang="ko">npm이 아닌 다른 패키지 매니저를 사용하는 경우 `npm install`을 해당 패키지 매니저의 명령어(예: `pnpm add`, `yarn add`)로 바꿔 주세요.</span>
|
|
74
74
|
|
|
75
75
|
## Usage / <span lang="ja">使い方</span> / <span lang="zh-Hans-CN">用法</span> / <span lang="ko">사용법</span>
|
|
76
76
|
|
|
@@ -89,17 +89,17 @@ This package provides a function and a namespace based on the original micromark
|
|
|
89
89
|
|
|
90
90
|
Also, this package provides some utility functions to check whether a character belongs to the category defined in the specification (e.g. CJK character), or to help you fetch the Unicode Code Point of a character around the emphasis mark.
|
|
91
91
|
|
|
92
|
-
## Specification / <span lang="ja">規格書</span> / <span lang="zh-Hans-CN">规范</span> / <span lang="ko"
|
|
92
|
+
## Specification / <span lang="ja">規格書</span> / <span lang="zh-Hans-CN">规范</span> / <span lang="ko">설명서</span>
|
|
93
93
|
|
|
94
94
|
https://github.com/tats-u/markdown-cjk-friendly/blob/main/specification.md (English)
|
|
95
95
|
|
|
96
96
|
## Related packages / <span lang="ja">関連パッケージ</span> / <span lang="zh-Hans-CN">相关包</span> / <span lang="ko">관련 패키지</span>
|
|
97
97
|
|
|
98
|
-
- [micromark-extension-cjk-friendly](https://npmjs.com/package/micromark-extension-cjk-friendly) [](https://npmjs.com/package/micromark-extension-cjk-friendly)  [](https://npmjs.com/package/micromark-extension-cjk-friendly) [](https://npmjs.com/package/micromark-extension-cjk-friendly)
|
|
99
|
-
- [remark-cjk-friendly](https://npmjs.com/package/remark-cjk-friendly) [](https://npmjs.com/package/remark-cjk-friendly)  [](https://npmjs.com/package/remark-cjk-friendly) [](https://npmjs.com/package/remark-cjk-friendly)
|
|
100
|
-
- [markdown-it-cjk-friendly](https://npmjs.com/package/markdown-it-cjk-friendly) [](https://npmjs.com/package/markdown-it-cjk-friendly)  [](https://npmjs.com/package/markdown-it-cjk-friendly) [](https://npmjs.com/package/markdown-it-cjk-friendly)
|
|
101
|
-
- [remark-cjk-friendly](https://npmjs.com/package/remark-cjk-friendly) [](https://npmjs.com/package/remark-cjk-friendly)  [](https://npmjs.com/package/remark-cjk-friendly) [](https://npmjs.com/package/remark-cjk-friendly)
|
|
102
|
-
- [micromark-extension-cjk-friendly-gfm-strikethrough](https://npmjs.com/package/micromark-extension-cjk-friendly-gfm-strikethrough) [](https://npmjs.com/package/micromark-extension-cjk-friendly-gfm-strikethrough)  [](https://npmjs.com/package/micromark-extension-cjk-friendly-gfm-strikethrough) [](https://npmjs.com/package/micromark-extension-cjk-friendly-gfm-strikethrough)
|
|
98
|
+
- [micromark-extension-cjk-friendly](https://npmjs.com/package/micromark-extension-cjk-friendly) [](https://npmjs.com/package/micromark-extension-cjk-friendly)  [](https://npmjs.com/package/micromark-extension-cjk-friendly) [](https://npmjs.com/package/micromark-extension-cjk-friendly) [](https://socket.dev/npm/package/micromark-extension-cjk-friendly) [](https://snyk.io/advisor/npm-package/micromark-extension-cjk-friendly)
|
|
99
|
+
- [remark-cjk-friendly](https://npmjs.com/package/remark-cjk-friendly) [](https://npmjs.com/package/remark-cjk-friendly)  [](https://npmjs.com/package/remark-cjk-friendly) [](https://npmjs.com/package/remark-cjk-friendly) [](https://socket.dev/npm/package/remark-cjk-friendly) [](https://snyk.io/advisor/npm-package/remark-cjk-friendly)
|
|
100
|
+
- [markdown-it-cjk-friendly](https://npmjs.com/package/markdown-it-cjk-friendly) [](https://npmjs.com/package/markdown-it-cjk-friendly)  [](https://npmjs.com/package/markdown-it-cjk-friendly) [](https://npmjs.com/package/markdown-it-cjk-friendly) [](https://socket.dev/npm/package/markdown-it-cjk-friendly) [](https://snyk.io/advisor/npm-package/markdown-it-cjk-friendly)
|
|
101
|
+
- [remark-cjk-friendly](https://npmjs.com/package/remark-cjk-friendly) [](https://npmjs.com/package/remark-cjk-friendly)  [](https://npmjs.com/package/remark-cjk-friendly) [](https://npmjs.com/package/remark-cjk-friendly) [](https://socket.dev/npm/package/remark-cjk-friendly) [](https://snyk.io/advisor/npm-package/remark-cjk-friendly)
|
|
102
|
+
- [micromark-extension-cjk-friendly-gfm-strikethrough](https://npmjs.com/package/micromark-extension-cjk-friendly-gfm-strikethrough) [](https://npmjs.com/package/micromark-extension-cjk-friendly-gfm-strikethrough)  [](https://npmjs.com/package/micromark-extension-cjk-friendly-gfm-strikethrough) [](https://npmjs.com/package/micromark-extension-cjk-friendly-gfm-strikethrough) [](https://socket.dev/npm/package/micromark-extension-cjk-friendly-gfm-strikethrough) [](https://snyk.io/advisor/npm-package/micromark-extension-cjk-friendly-gfm-strikethrough)
|
|
103
103
|
|
|
104
104
|
## Contributing / <span lang="ja">貢献</span> / <span lang="zh-Hans-CN">贡献</span> / <span lang="ko">기여</span>
|
|
105
105
|
|
package/dist/categoryUtil.d.ts
CHANGED
|
@@ -1,13 +1,12 @@
|
|
|
1
|
-
import { classifyCharacter } from
|
|
2
|
-
import 'micromark-util-symbol';
|
|
3
|
-
import 'micromark-util-types';
|
|
1
|
+
import { classifyCharacter } from "./classifyCharacter.js";
|
|
4
2
|
|
|
3
|
+
//#region src/categoryUtil.d.ts
|
|
5
4
|
type Category = ReturnType<typeof classifyCharacter>;
|
|
6
5
|
/**
|
|
7
|
-
* `true` if the code point represents
|
|
6
|
+
* `true` if the code point represents a [Unicode whitespace character](https://spec.commonmark.org/0.31.2/#unicode-whitespace-character).
|
|
8
7
|
*
|
|
9
8
|
* @param category the return value of `classifyCharacter`.
|
|
10
|
-
* @returns `true` if the code point represents
|
|
9
|
+
* @returns `true` if the code point represents a Unicode whitespace character
|
|
11
10
|
*/
|
|
12
11
|
declare function isUnicodeWhitespace(category: Category): boolean;
|
|
13
12
|
/**
|
|
@@ -46,11 +45,11 @@ declare function isCjkOrIvs(category: Category): boolean;
|
|
|
46
45
|
*/
|
|
47
46
|
declare function isNonEmojiGeneralUseVS(category: Category): boolean;
|
|
48
47
|
/**
|
|
49
|
-
* `true` if the code point represents
|
|
48
|
+
* `true` if the code point represents a [Unicode whitespace character](https://spec.commonmark.org/0.31.2/#unicode-whitespace-character) or a [Unicode punctuation character](https://spec.commonmark.org/0.31.2/#unicode-punctuation-character).
|
|
50
49
|
*
|
|
51
50
|
* @param category the return value of `classifyCharacter`.
|
|
52
51
|
* @returns `true` if the code point represents a space or punctuation
|
|
53
52
|
*/
|
|
54
53
|
declare function isSpaceOrPunctuation(category: Category): boolean;
|
|
55
|
-
|
|
56
|
-
export { isCjk, isCjkOrIvs, isIvs, isNonCjkPunctuation, isNonEmojiGeneralUseVS, isSpaceOrPunctuation, isUnicodeWhitespace };
|
|
54
|
+
//#endregion
|
|
55
|
+
export { isCjk, isCjkOrIvs, isIvs, isNonCjkPunctuation, isNonEmojiGeneralUseVS, isSpaceOrPunctuation, isUnicodeWhitespace };
|
package/dist/categoryUtil.js
CHANGED
|
@@ -1,49 +1,70 @@
|
|
|
1
|
-
|
|
2
|
-
import { constants
|
|
1
|
+
import { constantsEx } from "./classifyCharacter.js";
|
|
2
|
+
import { constants } from "micromark-util-symbol";
|
|
3
3
|
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
constantsEx2.cjkPunctuation = 4098;
|
|
12
|
-
constantsEx2.ivs = 8192;
|
|
13
|
-
constantsEx2.cjkOrIvs = 12288;
|
|
14
|
-
constantsEx2.nonEmojiGeneralUseVS = 16384;
|
|
15
|
-
constantsEx2.variationSelector = 24576;
|
|
16
|
-
constantsEx2.ivsToCjkRightShift = 1;
|
|
17
|
-
})(constantsEx || (constantsEx = {}));
|
|
18
|
-
|
|
19
|
-
// src/categoryUtil.ts
|
|
4
|
+
//#region src/categoryUtil.ts
|
|
5
|
+
/**
|
|
6
|
+
* `true` if the code point represents a [Unicode whitespace character](https://spec.commonmark.org/0.31.2/#unicode-whitespace-character).
|
|
7
|
+
*
|
|
8
|
+
* @param category the return value of `classifyCharacter`.
|
|
9
|
+
* @returns `true` if the code point represents a Unicode whitespace character
|
|
10
|
+
*/
|
|
20
11
|
function isUnicodeWhitespace(category) {
|
|
21
|
-
|
|
12
|
+
return Boolean(category & constants.characterGroupWhitespace);
|
|
22
13
|
}
|
|
14
|
+
/**
|
|
15
|
+
* `true` if the code point represents a [non-CJK punctuation character](https://github.com/tats-u/markdown-cjk-friendly/blob/main/specification.md#non-cjk-punctuation-character).
|
|
16
|
+
*
|
|
17
|
+
* @param category the return value of `classifyCharacter`.
|
|
18
|
+
* @returns `true` if the code point represents a non-CJK punctuation character
|
|
19
|
+
*/
|
|
23
20
|
function isNonCjkPunctuation(category) {
|
|
24
|
-
|
|
21
|
+
return (category & constantsEx.cjkPunctuation) === constants.characterGroupPunctuation;
|
|
25
22
|
}
|
|
23
|
+
/**
|
|
24
|
+
* `true` if the code point represents a [CJK character](https://github.com/tats-u/markdown-cjk-friendly/blob/main/specification.md#cjk-character).
|
|
25
|
+
*
|
|
26
|
+
* @param category the return value of `classifyCharacter`.
|
|
27
|
+
* @returns `true` if the code point represents a CJK character
|
|
28
|
+
*/
|
|
26
29
|
function isCjk(category) {
|
|
27
|
-
|
|
30
|
+
return Boolean(category & constantsEx.cjk);
|
|
28
31
|
}
|
|
32
|
+
/**
|
|
33
|
+
* `true` if the code point represents an [Ideographic Variation Selector](https://github.com/tats-u/markdown-cjk-friendly/blob/main/specification.md#ideographi-variation-selector).
|
|
34
|
+
*
|
|
35
|
+
* @param category the return value of `classifyCharacter`.
|
|
36
|
+
* @returns `true` if the code point represents an IVS
|
|
37
|
+
*/
|
|
29
38
|
function isIvs(category) {
|
|
30
|
-
|
|
39
|
+
return category === constantsEx.ivs;
|
|
31
40
|
}
|
|
41
|
+
/**
|
|
42
|
+
* `true` if {@link isCjk} or {@link isIvs}.
|
|
43
|
+
*
|
|
44
|
+
* @param category the return value of {@link classifyCharacter}.
|
|
45
|
+
* @returns `true` if the code point represents a CJK or IVS
|
|
46
|
+
*/
|
|
32
47
|
function isCjkOrIvs(category) {
|
|
33
|
-
|
|
48
|
+
return Boolean(category & constantsEx.cjkOrIvs);
|
|
34
49
|
}
|
|
50
|
+
/**
|
|
51
|
+
* `true` if the code point represents a [Non-emoji General-use Variation Selector](https://github.com/tats-u/markdown-cjk-friendly/blob/main/specification.md#non-emoji-general-use-variation-selector).
|
|
52
|
+
*
|
|
53
|
+
* @param category the return value of `classifyCharacter`.
|
|
54
|
+
* @returns `true` if the code point represents an Non-emoji General-use Variation Selector
|
|
55
|
+
*/
|
|
35
56
|
function isNonEmojiGeneralUseVS(category) {
|
|
36
|
-
|
|
57
|
+
return category === constantsEx.nonEmojiGeneralUseVS;
|
|
37
58
|
}
|
|
59
|
+
/**
|
|
60
|
+
* `true` if the code point represents a [Unicode whitespace character](https://spec.commonmark.org/0.31.2/#unicode-whitespace-character) or a [Unicode punctuation character](https://spec.commonmark.org/0.31.2/#unicode-punctuation-character).
|
|
61
|
+
*
|
|
62
|
+
* @param category the return value of `classifyCharacter`.
|
|
63
|
+
* @returns `true` if the code point represents a space or punctuation
|
|
64
|
+
*/
|
|
38
65
|
function isSpaceOrPunctuation(category) {
|
|
39
|
-
|
|
66
|
+
return Boolean(category & constantsEx.spaceOrPunctuation);
|
|
40
67
|
}
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
isIvs,
|
|
45
|
-
isNonCjkPunctuation,
|
|
46
|
-
isNonEmojiGeneralUseVS,
|
|
47
|
-
isSpaceOrPunctuation,
|
|
48
|
-
isUnicodeWhitespace
|
|
49
|
-
};
|
|
68
|
+
|
|
69
|
+
//#endregion
|
|
70
|
+
export { isCjk, isCjkOrIvs, isIvs, isNonCjkPunctuation, isNonEmojiGeneralUseVS, isSpaceOrPunctuation, isUnicodeWhitespace };
|
|
@@ -1,4 +1,6 @@
|
|
|
1
|
-
import { Code } from
|
|
1
|
+
import { Code } from "micromark-util-types";
|
|
2
|
+
|
|
3
|
+
//#region src/characterWithNonBmp.d.ts
|
|
2
4
|
|
|
3
5
|
/**
|
|
4
6
|
* Check if `uc` is CJK or IVS
|
|
@@ -53,5 +55,5 @@ declare const unicodePunctuation: (code: Code) => boolean;
|
|
|
53
55
|
* Whether it matches.
|
|
54
56
|
*/
|
|
55
57
|
declare const unicodeWhitespace: (code: Code) => boolean;
|
|
56
|
-
|
|
57
|
-
export { cjkOrIvs, isCjkAmbiguousPunctuation, nonEmojiGeneralUseVS, unicodePunctuation, unicodeWhitespace };
|
|
58
|
+
//#endregion
|
|
59
|
+
export { cjkOrIvs, isCjkAmbiguousPunctuation, nonEmojiGeneralUseVS, unicodePunctuation, unicodeWhitespace };
|
|
@@ -1,47 +1,99 @@
|
|
|
1
|
-
// src/characterWithNonBmp.ts
|
|
2
1
|
import { eastAsianWidthType } from "get-east-asian-width";
|
|
2
|
+
|
|
3
|
+
//#region src/characterWithNonBmp.ts
|
|
3
4
|
function isEmoji(uc) {
|
|
4
|
-
|
|
5
|
+
return /^\p{Emoji_Presentation}/u.test(String.fromCodePoint(uc));
|
|
5
6
|
}
|
|
7
|
+
/**
|
|
8
|
+
* Check if `uc` is CJK or IVS
|
|
9
|
+
*
|
|
10
|
+
* @param uc code point
|
|
11
|
+
* @returns `true` if `uc` is CJK, `null` if IVS, or `false` if neither
|
|
12
|
+
*/
|
|
6
13
|
function cjkOrIvs(uc) {
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
case "wide":
|
|
17
|
-
return !isEmoji(uc);
|
|
18
|
-
case "narrow":
|
|
19
|
-
return false;
|
|
20
|
-
case "ambiguous":
|
|
21
|
-
return 917760 <= uc && uc <= 917999 ? null : false;
|
|
22
|
-
case "neutral":
|
|
23
|
-
return /^\p{sc=Hangul}/u.test(String.fromCodePoint(uc));
|
|
24
|
-
}
|
|
14
|
+
if (!uc || uc < 4352) return false;
|
|
15
|
+
switch (eastAsianWidthType(uc)) {
|
|
16
|
+
case "fullwidth":
|
|
17
|
+
case "halfwidth": return true;
|
|
18
|
+
case "wide": return !isEmoji(uc);
|
|
19
|
+
case "narrow": return false;
|
|
20
|
+
case "ambiguous": return 917760 <= uc && uc <= 917999 ? null : false;
|
|
21
|
+
case "neutral": return /^\p{sc=Hangul}/u.test(String.fromCodePoint(uc));
|
|
22
|
+
}
|
|
25
23
|
}
|
|
26
24
|
function isCjkAmbiguousPunctuation(main, vs) {
|
|
27
|
-
|
|
28
|
-
|
|
25
|
+
if (vs !== 65025 || !main || main < 8216) return false;
|
|
26
|
+
return main === 8216 || main === 8217 || main === 8220 || main === 8221;
|
|
29
27
|
}
|
|
28
|
+
/**
|
|
29
|
+
* Check whether the character code represents Non-emoji General-use Variation Selector (U+FE00-U+FE0E).
|
|
30
|
+
*/
|
|
30
31
|
function nonEmojiGeneralUseVS(code) {
|
|
31
|
-
|
|
32
|
+
return code !== null && code >= 65024 && code <= 65038;
|
|
32
33
|
}
|
|
33
|
-
|
|
34
|
-
|
|
34
|
+
/**
|
|
35
|
+
* Check whether the character code represents Unicode punctuation.
|
|
36
|
+
*
|
|
37
|
+
* A **Unicode punctuation** is a character in the Unicode `Pc` (Punctuation,
|
|
38
|
+
* Connector), `Pd` (Punctuation, Dash), `Pe` (Punctuation, Close), `Pf`
|
|
39
|
+
* (Punctuation, Final quote), `Pi` (Punctuation, Initial quote), `Po`
|
|
40
|
+
* (Punctuation, Other), or `Ps` (Punctuation, Open) categories, or an ASCII
|
|
41
|
+
* punctuation (see `asciiPunctuation`).
|
|
42
|
+
*
|
|
43
|
+
* See:
|
|
44
|
+
* **\[UNICODE]**:
|
|
45
|
+
* [The Unicode Standard](https://www.unicode.org/versions/).
|
|
46
|
+
* Unicode Consortium.
|
|
47
|
+
*
|
|
48
|
+
* @param code
|
|
49
|
+
* Code.
|
|
50
|
+
* @returns
|
|
51
|
+
* Whether it matches.
|
|
52
|
+
*/
|
|
53
|
+
const unicodePunctuation = regexCheck(/\p{P}|\p{S}/u);
|
|
54
|
+
/**
|
|
55
|
+
* Check whether the character code represents Unicode whitespace.
|
|
56
|
+
*
|
|
57
|
+
* Note that this does handle micromark specific markdown whitespace characters.
|
|
58
|
+
* See `markdownLineEndingOrSpace` to check that.
|
|
59
|
+
*
|
|
60
|
+
* A **Unicode whitespace** is a character in the Unicode `Zs` (Separator,
|
|
61
|
+
* Space) category, or U+0009 CHARACTER TABULATION (HT), U+000A LINE FEED (LF),
|
|
62
|
+
* U+000C (FF), or U+000D CARRIAGE RETURN (CR) (**\[UNICODE]**).
|
|
63
|
+
*
|
|
64
|
+
* See:
|
|
65
|
+
* **\[UNICODE]**:
|
|
66
|
+
* [The Unicode Standard](https://www.unicode.org/versions/).
|
|
67
|
+
* Unicode Consortium.
|
|
68
|
+
*
|
|
69
|
+
* @param code
|
|
70
|
+
* Code.
|
|
71
|
+
* @returns
|
|
72
|
+
* Whether it matches.
|
|
73
|
+
*/
|
|
74
|
+
const unicodeWhitespace = regexCheck(/\s/);
|
|
75
|
+
/**
|
|
76
|
+
* Create a code check from a regex.
|
|
77
|
+
*
|
|
78
|
+
* @param regex
|
|
79
|
+
* Expression.
|
|
80
|
+
* @returns
|
|
81
|
+
* Check.
|
|
82
|
+
*/
|
|
35
83
|
function regexCheck(regex) {
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
84
|
+
return check;
|
|
85
|
+
/**
|
|
86
|
+
* Check whether a code matches the bound regex.
|
|
87
|
+
*
|
|
88
|
+
* @param code
|
|
89
|
+
* Character code.
|
|
90
|
+
* @returns
|
|
91
|
+
* Whether the character code matches the bound regex.
|
|
92
|
+
*/
|
|
93
|
+
function check(code) {
|
|
94
|
+
return code !== null && code > -1 && regex.test(String.fromCodePoint(code));
|
|
95
|
+
}
|
|
40
96
|
}
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
nonEmojiGeneralUseVS,
|
|
45
|
-
unicodePunctuation,
|
|
46
|
-
unicodeWhitespace
|
|
47
|
-
};
|
|
97
|
+
|
|
98
|
+
//#endregion
|
|
99
|
+
export { cjkOrIvs, isCjkAmbiguousPunctuation, nonEmojiGeneralUseVS, unicodePunctuation, unicodeWhitespace };
|
|
@@ -1,15 +1,16 @@
|
|
|
1
|
-
import { constants } from
|
|
2
|
-
import { Code } from
|
|
1
|
+
import { constants } from "micromark-util-symbol";
|
|
2
|
+
import { Code } from "micromark-util-types";
|
|
3
3
|
|
|
4
|
+
//#region src/classifyCharacter.d.ts
|
|
4
5
|
declare namespace constantsEx {
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
6
|
+
const spaceOrPunctuation: 3;
|
|
7
|
+
const cjk: 4096;
|
|
8
|
+
const cjkPunctuation: 4098;
|
|
9
|
+
const ivs: 8192;
|
|
10
|
+
const cjkOrIvs: 12288;
|
|
11
|
+
const nonEmojiGeneralUseVS: 16384;
|
|
12
|
+
const variationSelector: 24576;
|
|
13
|
+
const ivsToCjkRightShift: 1;
|
|
13
14
|
}
|
|
14
15
|
/**
|
|
15
16
|
* Classify whether a code represents whitespace, punctuation, or something
|
|
@@ -38,5 +39,5 @@ declare function classifyCharacter(code: Code): typeof constants.characterGroupW
|
|
|
38
39
|
* Group of the main code point of the preceding character. Use `isCjkOrIvs` to check whether it is CJK
|
|
39
40
|
*/
|
|
40
41
|
declare function classifyPrecedingCharacter(before: ReturnType<typeof classifyCharacter>, get2Previous: () => Code, previous: Code): ReturnType<typeof classifyCharacter>;
|
|
41
|
-
|
|
42
|
-
export { classifyCharacter, classifyPrecedingCharacter, constantsEx };
|
|
42
|
+
//#endregion
|
|
43
|
+
export { classifyCharacter, classifyPrecedingCharacter, constantsEx };
|
|
@@ -1,104 +1,69 @@
|
|
|
1
|
-
|
|
1
|
+
import { cjkOrIvs, isCjkAmbiguousPunctuation, nonEmojiGeneralUseVS, unicodePunctuation, unicodeWhitespace } from "./characterWithNonBmp.js";
|
|
2
|
+
import { isNonEmojiGeneralUseVS, isUnicodeWhitespace } from "./categoryUtil.js";
|
|
3
|
+
import { codes, constants } from "micromark-util-symbol";
|
|
2
4
|
import { markdownLineEndingOrSpace } from "micromark-util-character";
|
|
3
|
-
import { codes, constants as constants2 } from "micromark-util-symbol";
|
|
4
5
|
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
function
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
function isEmoji(uc) {
|
|
17
|
-
return /^\p{Emoji_Presentation}/u.test(String.fromCodePoint(uc));
|
|
18
|
-
}
|
|
19
|
-
function cjkOrIvs(uc) {
|
|
20
|
-
if (!uc || uc < 4352) {
|
|
21
|
-
return false;
|
|
22
|
-
}
|
|
23
|
-
const eaw = eastAsianWidthType(uc);
|
|
24
|
-
switch (eaw) {
|
|
25
|
-
case "fullwidth":
|
|
26
|
-
case "halfwidth":
|
|
27
|
-
return true;
|
|
28
|
-
// never be emoji
|
|
29
|
-
case "wide":
|
|
30
|
-
return !isEmoji(uc);
|
|
31
|
-
case "narrow":
|
|
32
|
-
return false;
|
|
33
|
-
case "ambiguous":
|
|
34
|
-
return 917760 <= uc && uc <= 917999 ? null : false;
|
|
35
|
-
case "neutral":
|
|
36
|
-
return /^\p{sc=Hangul}/u.test(String.fromCodePoint(uc));
|
|
37
|
-
}
|
|
38
|
-
}
|
|
39
|
-
function isCjkAmbiguousPunctuation(main, vs) {
|
|
40
|
-
if (vs !== 65025 || !main || main < 8216) return false;
|
|
41
|
-
return main === 8216 || main === 8217 || main === 8220 || main === 8221;
|
|
42
|
-
}
|
|
43
|
-
function nonEmojiGeneralUseVS(code) {
|
|
44
|
-
return code !== null && code >= 65024 && code <= 65038;
|
|
45
|
-
}
|
|
46
|
-
var unicodePunctuation = regexCheck(/\p{P}|\p{S}/u);
|
|
47
|
-
var unicodeWhitespace = regexCheck(/\s/);
|
|
48
|
-
function regexCheck(regex) {
|
|
49
|
-
return check;
|
|
50
|
-
function check(code) {
|
|
51
|
-
return code !== null && code > -1 && regex.test(String.fromCodePoint(code));
|
|
52
|
-
}
|
|
53
|
-
}
|
|
54
|
-
|
|
55
|
-
// src/classifyCharacter.ts
|
|
56
|
-
var constantsEx;
|
|
57
|
-
((constantsEx2) => {
|
|
58
|
-
constantsEx2.spaceOrPunctuation = 3;
|
|
59
|
-
constantsEx2.cjk = 4096;
|
|
60
|
-
constantsEx2.cjkPunctuation = 4098;
|
|
61
|
-
constantsEx2.ivs = 8192;
|
|
62
|
-
constantsEx2.cjkOrIvs = 12288;
|
|
63
|
-
constantsEx2.nonEmojiGeneralUseVS = 16384;
|
|
64
|
-
constantsEx2.variationSelector = 24576;
|
|
65
|
-
constantsEx2.ivsToCjkRightShift = 1;
|
|
6
|
+
//#region src/classifyCharacter.ts
|
|
7
|
+
let constantsEx;
|
|
8
|
+
(function(_constantsEx) {
|
|
9
|
+
_constantsEx.spaceOrPunctuation = 3;
|
|
10
|
+
_constantsEx.cjk = 4096;
|
|
11
|
+
_constantsEx.cjkPunctuation = 4098;
|
|
12
|
+
_constantsEx.ivs = 8192;
|
|
13
|
+
_constantsEx.cjkOrIvs = 12288;
|
|
14
|
+
_constantsEx.nonEmojiGeneralUseVS = 16384;
|
|
15
|
+
_constantsEx.variationSelector = 24576;
|
|
16
|
+
_constantsEx.ivsToCjkRightShift = 1;
|
|
66
17
|
})(constantsEx || (constantsEx = {}));
|
|
18
|
+
/**
|
|
19
|
+
* Classify whether a code represents whitespace, punctuation, or something
|
|
20
|
+
* else.
|
|
21
|
+
*
|
|
22
|
+
* Used for attention (emphasis, strong), whose sequences can open or close
|
|
23
|
+
* based on the class of surrounding characters.
|
|
24
|
+
*
|
|
25
|
+
* > 👉 **Note**: eof (`null`) is seen as whitespace.
|
|
26
|
+
*
|
|
27
|
+
* @param code
|
|
28
|
+
* Code.
|
|
29
|
+
* @returns
|
|
30
|
+
* Group.
|
|
31
|
+
*/
|
|
67
32
|
function classifyCharacter(code) {
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
break;
|
|
82
|
-
}
|
|
83
|
-
}
|
|
84
|
-
if (unicodePunctuation(code)) {
|
|
85
|
-
value |= constants2.characterGroupPunctuation;
|
|
86
|
-
}
|
|
87
|
-
return value;
|
|
33
|
+
if (code === codes.eof || markdownLineEndingOrSpace(code) || unicodeWhitespace(code)) return constants.characterGroupWhitespace;
|
|
34
|
+
let value = 0;
|
|
35
|
+
if (code >= 4352) {
|
|
36
|
+
if (nonEmojiGeneralUseVS(code)) return constantsEx.nonEmojiGeneralUseVS;
|
|
37
|
+
switch (cjkOrIvs(code)) {
|
|
38
|
+
case null: return constantsEx.ivs;
|
|
39
|
+
case true:
|
|
40
|
+
value |= constantsEx.cjk;
|
|
41
|
+
break;
|
|
42
|
+
}
|
|
43
|
+
}
|
|
44
|
+
if (unicodePunctuation(code)) value |= constants.characterGroupPunctuation;
|
|
45
|
+
return value;
|
|
88
46
|
}
|
|
47
|
+
/**}
|
|
48
|
+
* Classify whether a code represents whitespace, punctuation, or something else.
|
|
49
|
+
*
|
|
50
|
+
* Recognizes general-use variation selectors. Use this instead of {@linkcode classifyCharacter} for previous character.
|
|
51
|
+
*
|
|
52
|
+
* @param before result of {@linkcode classifyCharacter} of the preceding character.
|
|
53
|
+
* @param get2Previous a function that returns the code point of the character before the preceding character. Use lambda or {@linkcode Function.prototype.bind}.
|
|
54
|
+
* @param previous code point of the preceding character
|
|
55
|
+
* @returns
|
|
56
|
+
* Group of the main code point of the preceding character. Use `isCjkOrIvs` to check whether it is CJK
|
|
57
|
+
*/
|
|
89
58
|
function classifyPrecedingCharacter(before, get2Previous, previous) {
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
const twoBefore = classifyCharacter(twoPrevious);
|
|
95
|
-
return !twoPrevious || isUnicodeWhitespace(twoBefore) ? before : isCjkAmbiguousPunctuation(twoPrevious, previous) ? constantsEx.cjkPunctuation : stripIvs(twoBefore);
|
|
59
|
+
if (!isNonEmojiGeneralUseVS(before)) return before;
|
|
60
|
+
const twoPrevious = get2Previous();
|
|
61
|
+
const twoBefore = classifyCharacter(twoPrevious);
|
|
62
|
+
return !twoPrevious || isUnicodeWhitespace(twoBefore) ? before : isCjkAmbiguousPunctuation(twoPrevious, previous) ? constantsEx.cjkPunctuation : stripIvs(twoBefore);
|
|
96
63
|
}
|
|
97
64
|
function stripIvs(twoBefore) {
|
|
98
|
-
|
|
65
|
+
return twoBefore & ~constantsEx.ivs;
|
|
99
66
|
}
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
constantsEx
|
|
104
|
-
};
|
|
67
|
+
|
|
68
|
+
//#endregion
|
|
69
|
+
export { classifyCharacter, classifyPrecedingCharacter, constantsEx };
|
package/dist/codeUtil.d.ts
CHANGED
|
@@ -1,4 +1,6 @@
|
|
|
1
|
-
import { Code, Point, TokenizeContext } from
|
|
1
|
+
import { Code, Point, TokenizeContext } from "micromark-util-types";
|
|
2
|
+
|
|
3
|
+
//#region src/codeUtil.d.ts
|
|
2
4
|
|
|
3
5
|
/**
|
|
4
6
|
* Check if the given code is a [High-Surrogate Code Unit](https://www.unicode.org/glossary/#high_surrogate_code_unit).
|
|
@@ -42,28 +44,28 @@ declare function tryGetCodeTwoBefore(previousCode: Exclude<Code, null>, nowPoint
|
|
|
42
44
|
* @see {@link tryGetCodeTwoBefore}
|
|
43
45
|
*/
|
|
44
46
|
declare class TwoPreviousCode {
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
47
|
+
readonly previousCode: Exclude<Code, null>;
|
|
48
|
+
readonly nowPoint: Point;
|
|
49
|
+
readonly sliceSerialize: TokenizeContext["sliceSerialize"];
|
|
50
|
+
private cachedValue;
|
|
51
|
+
/**
|
|
52
|
+
* @see {@link tryGetCodeTwoBefore}
|
|
53
|
+
*
|
|
54
|
+
* @param previousCode a previous code point. Should be greater than 65,535 if it represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character).
|
|
55
|
+
* @param nowPoint `this.now()` (`this` = `TokenizeContext`)
|
|
56
|
+
* @param sliceSerialize `this.sliceSerialize` (`this` = `TokenizeContext`)
|
|
57
|
+
*/
|
|
58
|
+
constructor(previousCode: Exclude<Code, null>, nowPoint: Point, sliceSerialize: TokenizeContext["sliceSerialize"]);
|
|
59
|
+
/**
|
|
60
|
+
* Returns the return value of {@link tryGetCodeTwoBefore}.
|
|
61
|
+
*
|
|
62
|
+
* If the value has not been computed yet, it will be computed and cached.
|
|
63
|
+
*
|
|
64
|
+
* @see {@link tryGetCodeTwoBefore}
|
|
65
|
+
*
|
|
66
|
+
* @returns a value greater than 65,535 if the code point two positions before represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character), a value less than 65,536 for a [BMP Character](https://www.unicode.org/glossary/#bmp_character), or `null` if not found
|
|
67
|
+
*/
|
|
68
|
+
value(): Code;
|
|
67
69
|
}
|
|
68
70
|
/**
|
|
69
71
|
* If `code` is a [High-Surrogate Code Unit](https://www.unicode.org/glossary/#high_surrogate_code_unit), try to get a genuine next [Unicode Scalar Value](https://www.unicode.org/glossary/#unicode_scalar_value) corresponding to the High-Surrogate Code Unit.
|
|
@@ -73,5 +75,5 @@ declare class TwoPreviousCode {
|
|
|
73
75
|
* @returns a value greater than 65,535 if the next code point represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character), or `code` otherwise
|
|
74
76
|
*/
|
|
75
77
|
declare function tryGetGenuineNextCode(code: Exclude<Code, null>, nowPoint: Point, sliceSerialize: TokenizeContext["sliceSerialize"]): Exclude<Code, null>;
|
|
76
|
-
|
|
77
|
-
export { TwoPreviousCode, isCodeHighSurrogate, isCodeLowSurrogate, tryGetCodeTwoBefore, tryGetGenuineNextCode, tryGetGenuinePreviousCode };
|
|
78
|
+
//#endregion
|
|
79
|
+
export { TwoPreviousCode, isCodeHighSurrogate, isCodeLowSurrogate, tryGetCodeTwoBefore, tryGetGenuineNextCode, tryGetGenuinePreviousCode };
|
package/dist/codeUtil.js
CHANGED
|
@@ -1,104 +1,124 @@
|
|
|
1
|
-
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
1
|
+
//#region src/codeUtil.ts
|
|
2
|
+
/**
|
|
3
|
+
* Check if the given code is a [High-Surrogate Code Unit](https://www.unicode.org/glossary/#high_surrogate_code_unit).
|
|
4
|
+
*
|
|
5
|
+
* A High-Surrogate Code Unit is the _first_ half of a [Surrogate Pair](https://www.unicode.org/glossary/#surrogate_pair).
|
|
6
|
+
*
|
|
7
|
+
* @param code Code.
|
|
8
|
+
* @returns `true` if the code is a High-Surrogate Code Unit, `false` otherwise.
|
|
9
|
+
*/
|
|
6
10
|
function isCodeHighSurrogate(code) {
|
|
7
|
-
|
|
11
|
+
return Boolean(code && code >= 55296 && code <= 56319);
|
|
8
12
|
}
|
|
13
|
+
/**
|
|
14
|
+
* Check if the given code is a [Low-Surrogate Code Unit](https://www.unicode.org/glossary/#low_surrogate_code_unit).
|
|
15
|
+
*
|
|
16
|
+
* A Low-Surrogate Code Unit is the _second_ half of a [Surrogate Pair](https://www.unicode.org/glossary/#surrogate_pair).
|
|
17
|
+
* @param code
|
|
18
|
+
* The character code to check.
|
|
19
|
+
* @returns
|
|
20
|
+
* True if the code is a Low-Surrogate Code Unit, false otherwise.
|
|
21
|
+
*/
|
|
9
22
|
function isCodeLowSurrogate(code) {
|
|
10
|
-
|
|
23
|
+
return Boolean(code && code >= 56320 && code <= 57343);
|
|
11
24
|
}
|
|
25
|
+
/**
|
|
26
|
+
* If `code` is a [Low-Surrogate Code Unit](https://www.unicode.org/glossary/#low_surrogate_code_unit), try to get a genuine previous [Unicode Scalar Value](https://www.unicode.org/glossary/#unicode_scalar_value) corresponding to the Low-Surrogate Code Unit.
|
|
27
|
+
* @param code a tentative previous [code unit](https://www.unicode.org/glossary/#code_unit) less than 65,536, including a Low-Surrogate one
|
|
28
|
+
* @param nowPoint `this.now()` (`this` = `TokenizeContext`)
|
|
29
|
+
* @param sliceSerialize `this.sliceSerialize` (`this` = `TokenizeContext`)
|
|
30
|
+
* @returns a value greater than 65,535 if the previous code point represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character), or `code` otherwise
|
|
31
|
+
*/
|
|
12
32
|
function tryGetGenuinePreviousCode(code, nowPoint, sliceSerialize) {
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
return previousCandidate && previousCandidate >= 65536 ? previousCandidate : code;
|
|
33
|
+
if (nowPoint._bufferIndex < 2) return code;
|
|
34
|
+
const previousCandidate = sliceSerialize({
|
|
35
|
+
start: {
|
|
36
|
+
...nowPoint,
|
|
37
|
+
_bufferIndex: nowPoint._bufferIndex - 2
|
|
38
|
+
},
|
|
39
|
+
end: nowPoint
|
|
40
|
+
}).codePointAt(0);
|
|
41
|
+
return previousCandidate && previousCandidate >= 65536 ? previousCandidate : code;
|
|
23
42
|
}
|
|
43
|
+
/**
|
|
44
|
+
* Try to get the [Unicode Code Point](https://www.unicode.org/glossary/#code_point) two positions before the current position.
|
|
45
|
+
*
|
|
46
|
+
* @param previousCode a previous code point. Should be greater than 65,535 if it represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character).
|
|
47
|
+
* @param nowPoint `this.now()` (`this` = `TokenizeContext`)
|
|
48
|
+
* @param sliceSerialize `this.sliceSerialize` (`this` = `TokenizeContext`)
|
|
49
|
+
* @returns a value greater than 65,535 if the code point two positions before represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character), a value less than 65,536 for a [BMP Character](https://www.unicode.org/glossary/#bmp_character), or `null` if not found
|
|
50
|
+
*/
|
|
24
51
|
function tryGetCodeTwoBefore(previousCode, nowPoint, sliceSerialize) {
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
if (Number.isNaN(twoPreviousLast)) {
|
|
45
|
-
return null;
|
|
46
|
-
}
|
|
47
|
-
if (twoPreviousBuffer.length < 2 || twoPreviousLast < 56320 || 57343 < twoPreviousLast) {
|
|
48
|
-
return twoPreviousLast;
|
|
49
|
-
}
|
|
50
|
-
const twoPreviousCandidate = twoPreviousBuffer.codePointAt(0);
|
|
51
|
-
if (twoPreviousCandidate && twoPreviousCandidate >= 65536) {
|
|
52
|
-
return twoPreviousCandidate;
|
|
53
|
-
}
|
|
54
|
-
return twoPreviousLast;
|
|
52
|
+
const previousWidth = previousCode >= 65536 ? 2 : 1;
|
|
53
|
+
if (nowPoint._bufferIndex < 1 + previousWidth) return null;
|
|
54
|
+
const idealStart = nowPoint._bufferIndex - previousWidth - 2;
|
|
55
|
+
const twoPreviousBuffer = sliceSerialize({
|
|
56
|
+
start: {
|
|
57
|
+
...nowPoint,
|
|
58
|
+
_bufferIndex: idealStart >= 0 ? idealStart : 0
|
|
59
|
+
},
|
|
60
|
+
end: {
|
|
61
|
+
...nowPoint,
|
|
62
|
+
_bufferIndex: nowPoint._bufferIndex - previousWidth
|
|
63
|
+
}
|
|
64
|
+
});
|
|
65
|
+
const twoPreviousLast = twoPreviousBuffer.charCodeAt(twoPreviousBuffer.length - 1);
|
|
66
|
+
if (Number.isNaN(twoPreviousLast)) return null;
|
|
67
|
+
if (twoPreviousBuffer.length < 2 || twoPreviousLast < 56320 || 57343 < twoPreviousLast) return twoPreviousLast;
|
|
68
|
+
const twoPreviousCandidate = twoPreviousBuffer.codePointAt(0);
|
|
69
|
+
if (twoPreviousCandidate && twoPreviousCandidate >= 65536) return twoPreviousCandidate;
|
|
70
|
+
return twoPreviousLast;
|
|
55
71
|
}
|
|
72
|
+
/**
|
|
73
|
+
* Lazily get the [Unicode Code Point](https://www.unicode.org/glossary/#code_point) two positions before the current position only if necessary.
|
|
74
|
+
*
|
|
75
|
+
* @see {@link tryGetCodeTwoBefore}
|
|
76
|
+
*/
|
|
56
77
|
var TwoPreviousCode = class {
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
this.nowPoint,
|
|
84
|
-
this.sliceSerialize
|
|
85
|
-
);
|
|
86
|
-
}
|
|
87
|
-
return this.cachedValue;
|
|
88
|
-
}
|
|
78
|
+
cachedValue = void 0;
|
|
79
|
+
/**
|
|
80
|
+
* @see {@link tryGetCodeTwoBefore}
|
|
81
|
+
*
|
|
82
|
+
* @param previousCode a previous code point. Should be greater than 65,535 if it represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character).
|
|
83
|
+
* @param nowPoint `this.now()` (`this` = `TokenizeContext`)
|
|
84
|
+
* @param sliceSerialize `this.sliceSerialize` (`this` = `TokenizeContext`)
|
|
85
|
+
*/
|
|
86
|
+
constructor(previousCode, nowPoint, sliceSerialize) {
|
|
87
|
+
this.previousCode = previousCode;
|
|
88
|
+
this.nowPoint = nowPoint;
|
|
89
|
+
this.sliceSerialize = sliceSerialize;
|
|
90
|
+
}
|
|
91
|
+
/**
|
|
92
|
+
* Returns the return value of {@link tryGetCodeTwoBefore}.
|
|
93
|
+
*
|
|
94
|
+
* If the value has not been computed yet, it will be computed and cached.
|
|
95
|
+
*
|
|
96
|
+
* @see {@link tryGetCodeTwoBefore}
|
|
97
|
+
*
|
|
98
|
+
* @returns a value greater than 65,535 if the code point two positions before represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character), a value less than 65,536 for a [BMP Character](https://www.unicode.org/glossary/#bmp_character), or `null` if not found
|
|
99
|
+
*/
|
|
100
|
+
value() {
|
|
101
|
+
if (this.cachedValue === void 0) this.cachedValue = tryGetCodeTwoBefore(this.previousCode, this.nowPoint, this.sliceSerialize);
|
|
102
|
+
return this.cachedValue;
|
|
103
|
+
}
|
|
89
104
|
};
|
|
105
|
+
/**
|
|
106
|
+
* If `code` is a [High-Surrogate Code Unit](https://www.unicode.org/glossary/#high_surrogate_code_unit), try to get a genuine next [Unicode Scalar Value](https://www.unicode.org/glossary/#unicode_scalar_value) corresponding to the High-Surrogate Code Unit.
|
|
107
|
+
* @param code a tentative next [code unit](https://www.unicode.org/glossary/#code_unit) less than 65,536, including a High-Surrogate one
|
|
108
|
+
* @param nowPoint `this.now()` (`this` = `TokenizeContext`)
|
|
109
|
+
* @param sliceSerialize `this.sliceSerialize` (`this` = `TokenizeContext`)
|
|
110
|
+
* @returns a value greater than 65,535 if the next code point represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character), or `code` otherwise
|
|
111
|
+
*/
|
|
90
112
|
function tryGetGenuineNextCode(code, nowPoint, sliceSerialize) {
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
113
|
+
const nextCandidate = sliceSerialize({
|
|
114
|
+
start: nowPoint,
|
|
115
|
+
end: {
|
|
116
|
+
...nowPoint,
|
|
117
|
+
_bufferIndex: nowPoint._bufferIndex + 2
|
|
118
|
+
}
|
|
119
|
+
}).codePointAt(0);
|
|
120
|
+
return nextCandidate && nextCandidate >= 65536 ? nextCandidate : code;
|
|
96
121
|
}
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
isCodeLowSurrogate,
|
|
101
|
-
tryGetCodeTwoBefore,
|
|
102
|
-
tryGetGenuineNextCode,
|
|
103
|
-
tryGetGenuinePreviousCode
|
|
104
|
-
};
|
|
122
|
+
|
|
123
|
+
//#endregion
|
|
124
|
+
export { TwoPreviousCode, isCodeHighSurrogate, isCodeLowSurrogate, tryGetCodeTwoBefore, tryGetGenuineNextCode, tryGetGenuinePreviousCode };
|
package/dist/index.d.ts
CHANGED
|
@@ -1,5 +1,4 @@
|
|
|
1
|
-
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
import 'micromark-util-types';
|
|
1
|
+
import { classifyCharacter, classifyPrecedingCharacter, constantsEx } from "./classifyCharacter.js";
|
|
2
|
+
import { isCjk, isCjkOrIvs, isIvs, isNonCjkPunctuation, isNonEmojiGeneralUseVS, isSpaceOrPunctuation, isUnicodeWhitespace } from "./categoryUtil.js";
|
|
3
|
+
import { TwoPreviousCode, isCodeHighSurrogate, isCodeLowSurrogate, tryGetCodeTwoBefore, tryGetGenuineNextCode, tryGetGenuinePreviousCode } from "./codeUtil.js";
|
|
4
|
+
export { TwoPreviousCode, classifyCharacter, classifyPrecedingCharacter, constantsEx, isCjk, isCjkOrIvs, isCodeHighSurrogate, isCodeLowSurrogate, isIvs, isNonCjkPunctuation, isNonEmojiGeneralUseVS, isSpaceOrPunctuation, isUnicodeWhitespace, tryGetCodeTwoBefore, tryGetGenuineNextCode, tryGetGenuinePreviousCode };
|
package/dist/index.js
CHANGED
|
@@ -1,231 +1,5 @@
|
|
|
1
|
-
|
|
2
|
-
|
|
3
|
-
|
|
1
|
+
import { classifyCharacter, classifyPrecedingCharacter, constantsEx } from "./classifyCharacter.js";
|
|
2
|
+
import { isCjk, isCjkOrIvs, isIvs, isNonCjkPunctuation, isNonEmojiGeneralUseVS, isSpaceOrPunctuation, isUnicodeWhitespace } from "./categoryUtil.js";
|
|
3
|
+
import { TwoPreviousCode, isCodeHighSurrogate, isCodeLowSurrogate, tryGetCodeTwoBefore, tryGetGenuineNextCode, tryGetGenuinePreviousCode } from "./codeUtil.js";
|
|
4
4
|
|
|
5
|
-
|
|
6
|
-
import { constants as constants2 } from "micromark-util-symbol";
|
|
7
|
-
|
|
8
|
-
// src/classifyCharacter.ts
|
|
9
|
-
import { markdownLineEndingOrSpace } from "micromark-util-character";
|
|
10
|
-
import { codes, constants } from "micromark-util-symbol";
|
|
11
|
-
|
|
12
|
-
// src/characterWithNonBmp.ts
|
|
13
|
-
import { eastAsianWidthType } from "get-east-asian-width";
|
|
14
|
-
function isEmoji(uc) {
|
|
15
|
-
return /^\p{Emoji_Presentation}/u.test(String.fromCodePoint(uc));
|
|
16
|
-
}
|
|
17
|
-
function cjkOrIvs(uc) {
|
|
18
|
-
if (!uc || uc < 4352) {
|
|
19
|
-
return false;
|
|
20
|
-
}
|
|
21
|
-
const eaw = eastAsianWidthType(uc);
|
|
22
|
-
switch (eaw) {
|
|
23
|
-
case "fullwidth":
|
|
24
|
-
case "halfwidth":
|
|
25
|
-
return true;
|
|
26
|
-
// never be emoji
|
|
27
|
-
case "wide":
|
|
28
|
-
return !isEmoji(uc);
|
|
29
|
-
case "narrow":
|
|
30
|
-
return false;
|
|
31
|
-
case "ambiguous":
|
|
32
|
-
return 917760 <= uc && uc <= 917999 ? null : false;
|
|
33
|
-
case "neutral":
|
|
34
|
-
return /^\p{sc=Hangul}/u.test(String.fromCodePoint(uc));
|
|
35
|
-
}
|
|
36
|
-
}
|
|
37
|
-
function isCjkAmbiguousPunctuation(main, vs) {
|
|
38
|
-
if (vs !== 65025 || !main || main < 8216) return false;
|
|
39
|
-
return main === 8216 || main === 8217 || main === 8220 || main === 8221;
|
|
40
|
-
}
|
|
41
|
-
function nonEmojiGeneralUseVS(code) {
|
|
42
|
-
return code !== null && code >= 65024 && code <= 65038;
|
|
43
|
-
}
|
|
44
|
-
var unicodePunctuation = regexCheck(/\p{P}|\p{S}/u);
|
|
45
|
-
var unicodeWhitespace = regexCheck(/\s/);
|
|
46
|
-
function regexCheck(regex) {
|
|
47
|
-
return check;
|
|
48
|
-
function check(code) {
|
|
49
|
-
return code !== null && code > -1 && regex.test(String.fromCodePoint(code));
|
|
50
|
-
}
|
|
51
|
-
}
|
|
52
|
-
|
|
53
|
-
// src/classifyCharacter.ts
|
|
54
|
-
var constantsEx;
|
|
55
|
-
((constantsEx2) => {
|
|
56
|
-
constantsEx2.spaceOrPunctuation = 3;
|
|
57
|
-
constantsEx2.cjk = 4096;
|
|
58
|
-
constantsEx2.cjkPunctuation = 4098;
|
|
59
|
-
constantsEx2.ivs = 8192;
|
|
60
|
-
constantsEx2.cjkOrIvs = 12288;
|
|
61
|
-
constantsEx2.nonEmojiGeneralUseVS = 16384;
|
|
62
|
-
constantsEx2.variationSelector = 24576;
|
|
63
|
-
constantsEx2.ivsToCjkRightShift = 1;
|
|
64
|
-
})(constantsEx || (constantsEx = {}));
|
|
65
|
-
function classifyCharacter(code) {
|
|
66
|
-
if (code === codes.eof || markdownLineEndingOrSpace(code) || unicodeWhitespace(code)) {
|
|
67
|
-
return constants.characterGroupWhitespace;
|
|
68
|
-
}
|
|
69
|
-
let value = 0;
|
|
70
|
-
if (code >= 4352) {
|
|
71
|
-
if (nonEmojiGeneralUseVS(code)) {
|
|
72
|
-
return constantsEx.nonEmojiGeneralUseVS;
|
|
73
|
-
}
|
|
74
|
-
switch (cjkOrIvs(code)) {
|
|
75
|
-
case null:
|
|
76
|
-
return constantsEx.ivs;
|
|
77
|
-
case true:
|
|
78
|
-
value |= constantsEx.cjk;
|
|
79
|
-
break;
|
|
80
|
-
}
|
|
81
|
-
}
|
|
82
|
-
if (unicodePunctuation(code)) {
|
|
83
|
-
value |= constants.characterGroupPunctuation;
|
|
84
|
-
}
|
|
85
|
-
return value;
|
|
86
|
-
}
|
|
87
|
-
function classifyPrecedingCharacter(before, get2Previous, previous) {
|
|
88
|
-
if (!isNonEmojiGeneralUseVS(before)) {
|
|
89
|
-
return before;
|
|
90
|
-
}
|
|
91
|
-
const twoPrevious = get2Previous();
|
|
92
|
-
const twoBefore = classifyCharacter(twoPrevious);
|
|
93
|
-
return !twoPrevious || isUnicodeWhitespace(twoBefore) ? before : isCjkAmbiguousPunctuation(twoPrevious, previous) ? constantsEx.cjkPunctuation : stripIvs(twoBefore);
|
|
94
|
-
}
|
|
95
|
-
function stripIvs(twoBefore) {
|
|
96
|
-
return twoBefore & ~constantsEx.ivs;
|
|
97
|
-
}
|
|
98
|
-
|
|
99
|
-
// src/categoryUtil.ts
|
|
100
|
-
function isUnicodeWhitespace(category) {
|
|
101
|
-
return Boolean(category & constants2.characterGroupWhitespace);
|
|
102
|
-
}
|
|
103
|
-
function isNonCjkPunctuation(category) {
|
|
104
|
-
return (category & constantsEx.cjkPunctuation) === constants2.characterGroupPunctuation;
|
|
105
|
-
}
|
|
106
|
-
function isCjk(category) {
|
|
107
|
-
return Boolean(category & constantsEx.cjk);
|
|
108
|
-
}
|
|
109
|
-
function isIvs(category) {
|
|
110
|
-
return category === constantsEx.ivs;
|
|
111
|
-
}
|
|
112
|
-
function isCjkOrIvs(category) {
|
|
113
|
-
return Boolean(category & constantsEx.cjkOrIvs);
|
|
114
|
-
}
|
|
115
|
-
function isNonEmojiGeneralUseVS(category) {
|
|
116
|
-
return category === constantsEx.nonEmojiGeneralUseVS;
|
|
117
|
-
}
|
|
118
|
-
function isSpaceOrPunctuation(category) {
|
|
119
|
-
return Boolean(category & constantsEx.spaceOrPunctuation);
|
|
120
|
-
}
|
|
121
|
-
|
|
122
|
-
// src/codeUtil.ts
|
|
123
|
-
function isCodeHighSurrogate(code) {
|
|
124
|
-
return Boolean(code && code >= 55296 && code <= 56319);
|
|
125
|
-
}
|
|
126
|
-
function isCodeLowSurrogate(code) {
|
|
127
|
-
return Boolean(code && code >= 56320 && code <= 57343);
|
|
128
|
-
}
|
|
129
|
-
function tryGetGenuinePreviousCode(code, nowPoint, sliceSerialize) {
|
|
130
|
-
if (nowPoint._bufferIndex < 2) {
|
|
131
|
-
return code;
|
|
132
|
-
}
|
|
133
|
-
const previousBuffer = sliceSerialize({
|
|
134
|
-
// take 2 characters (code units)
|
|
135
|
-
start: { ...nowPoint, _bufferIndex: nowPoint._bufferIndex - 2 },
|
|
136
|
-
end: nowPoint
|
|
137
|
-
});
|
|
138
|
-
const previousCandidate = previousBuffer.codePointAt(0);
|
|
139
|
-
return previousCandidate && previousCandidate >= 65536 ? previousCandidate : code;
|
|
140
|
-
}
|
|
141
|
-
function tryGetCodeTwoBefore(previousCode, nowPoint, sliceSerialize) {
|
|
142
|
-
const previousWidth = previousCode >= 65536 ? 2 : 1;
|
|
143
|
-
if (nowPoint._bufferIndex < 1 + previousWidth) {
|
|
144
|
-
return null;
|
|
145
|
-
}
|
|
146
|
-
const idealStart = nowPoint._bufferIndex - previousWidth - 2;
|
|
147
|
-
const twoPreviousBuffer = sliceSerialize({
|
|
148
|
-
// take 1--2 character
|
|
149
|
-
start: {
|
|
150
|
-
...nowPoint,
|
|
151
|
-
_bufferIndex: idealStart >= 0 ? idealStart : 0
|
|
152
|
-
},
|
|
153
|
-
end: {
|
|
154
|
-
...nowPoint,
|
|
155
|
-
_bufferIndex: nowPoint._bufferIndex - previousWidth
|
|
156
|
-
}
|
|
157
|
-
});
|
|
158
|
-
const twoPreviousLast = twoPreviousBuffer.charCodeAt(
|
|
159
|
-
twoPreviousBuffer.length - 1
|
|
160
|
-
);
|
|
161
|
-
if (Number.isNaN(twoPreviousLast)) {
|
|
162
|
-
return null;
|
|
163
|
-
}
|
|
164
|
-
if (twoPreviousBuffer.length < 2 || twoPreviousLast < 56320 || 57343 < twoPreviousLast) {
|
|
165
|
-
return twoPreviousLast;
|
|
166
|
-
}
|
|
167
|
-
const twoPreviousCandidate = twoPreviousBuffer.codePointAt(0);
|
|
168
|
-
if (twoPreviousCandidate && twoPreviousCandidate >= 65536) {
|
|
169
|
-
return twoPreviousCandidate;
|
|
170
|
-
}
|
|
171
|
-
return twoPreviousLast;
|
|
172
|
-
}
|
|
173
|
-
var TwoPreviousCode = class {
|
|
174
|
-
/**
|
|
175
|
-
* @see {@link tryGetCodeTwoBefore}
|
|
176
|
-
*
|
|
177
|
-
* @param previousCode a previous code point. Should be greater than 65,535 if it represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character).
|
|
178
|
-
* @param nowPoint `this.now()` (`this` = `TokenizeContext`)
|
|
179
|
-
* @param sliceSerialize `this.sliceSerialize` (`this` = `TokenizeContext`)
|
|
180
|
-
*/
|
|
181
|
-
constructor(previousCode, nowPoint, sliceSerialize) {
|
|
182
|
-
this.previousCode = previousCode;
|
|
183
|
-
this.nowPoint = nowPoint;
|
|
184
|
-
this.sliceSerialize = sliceSerialize;
|
|
185
|
-
__publicField(this, "cachedValue");
|
|
186
|
-
}
|
|
187
|
-
/**
|
|
188
|
-
* Returns the return value of {@link tryGetCodeTwoBefore}.
|
|
189
|
-
*
|
|
190
|
-
* If the value has not been computed yet, it will be computed and cached.
|
|
191
|
-
*
|
|
192
|
-
* @see {@link tryGetCodeTwoBefore}
|
|
193
|
-
*
|
|
194
|
-
* @returns a value greater than 65,535 if the code point two positions before represents a [Supplementary Character](https://www.unicode.org/glossary/#supplementary_character), a value less than 65,536 for a [BMP Character](https://www.unicode.org/glossary/#bmp_character), or `null` if not found
|
|
195
|
-
*/
|
|
196
|
-
value() {
|
|
197
|
-
if (this.cachedValue === void 0) {
|
|
198
|
-
this.cachedValue = tryGetCodeTwoBefore(
|
|
199
|
-
this.previousCode,
|
|
200
|
-
this.nowPoint,
|
|
201
|
-
this.sliceSerialize
|
|
202
|
-
);
|
|
203
|
-
}
|
|
204
|
-
return this.cachedValue;
|
|
205
|
-
}
|
|
206
|
-
};
|
|
207
|
-
function tryGetGenuineNextCode(code, nowPoint, sliceSerialize) {
|
|
208
|
-
const nextCandidate = sliceSerialize({
|
|
209
|
-
start: nowPoint,
|
|
210
|
-
end: { ...nowPoint, _bufferIndex: nowPoint._bufferIndex + 2 }
|
|
211
|
-
}).codePointAt(0);
|
|
212
|
-
return nextCandidate && nextCandidate >= 65536 ? nextCandidate : code;
|
|
213
|
-
}
|
|
214
|
-
export {
|
|
215
|
-
TwoPreviousCode,
|
|
216
|
-
classifyCharacter,
|
|
217
|
-
classifyPrecedingCharacter,
|
|
218
|
-
constantsEx,
|
|
219
|
-
isCjk,
|
|
220
|
-
isCjkOrIvs,
|
|
221
|
-
isCodeHighSurrogate,
|
|
222
|
-
isCodeLowSurrogate,
|
|
223
|
-
isIvs,
|
|
224
|
-
isNonCjkPunctuation,
|
|
225
|
-
isNonEmojiGeneralUseVS,
|
|
226
|
-
isSpaceOrPunctuation,
|
|
227
|
-
isUnicodeWhitespace,
|
|
228
|
-
tryGetCodeTwoBefore,
|
|
229
|
-
tryGetGenuineNextCode,
|
|
230
|
-
tryGetGenuinePreviousCode
|
|
231
|
-
};
|
|
5
|
+
export { TwoPreviousCode, classifyCharacter, classifyPrecedingCharacter, constantsEx, isCjk, isCjkOrIvs, isCodeHighSurrogate, isCodeLowSurrogate, isIvs, isNonCjkPunctuation, isNonEmojiGeneralUseVS, isSpaceOrPunctuation, isUnicodeWhitespace, tryGetCodeTwoBefore, tryGetGenuineNextCode, tryGetGenuinePreviousCode };
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "micromark-extension-cjk-friendly-util",
|
|
3
|
-
"version": "
|
|
3
|
+
"version": "3.0.0",
|
|
4
4
|
"type": "module",
|
|
5
5
|
"exports": {
|
|
6
6
|
".": {
|
|
@@ -33,7 +33,7 @@
|
|
|
33
33
|
"description": "common library for micromark-extension-cjk-friendly and its related packages",
|
|
34
34
|
"sideEffects": false,
|
|
35
35
|
"dependencies": {
|
|
36
|
-
"get-east-asian-width": "^1.
|
|
36
|
+
"get-east-asian-width": "^1.4.0",
|
|
37
37
|
"micromark-util-character": "^2.1.1",
|
|
38
38
|
"micromark-util-symbol": "^2.0.1"
|
|
39
39
|
},
|
|
@@ -46,15 +46,13 @@
|
|
|
46
46
|
}
|
|
47
47
|
},
|
|
48
48
|
"engines": {
|
|
49
|
-
"node": ">=
|
|
49
|
+
"node": ">=18"
|
|
50
50
|
},
|
|
51
51
|
"scripts": {
|
|
52
|
-
"build
|
|
53
|
-
"build": "
|
|
54
|
-
"
|
|
55
|
-
"dev:
|
|
56
|
-
"dev": "tsup --watch",
|
|
57
|
-
"dev:lib": "tsup --watch",
|
|
52
|
+
"build": "tsdown",
|
|
53
|
+
"build:lib": "tsdown",
|
|
54
|
+
"dev": "tsdown --watch",
|
|
55
|
+
"dev:lib": "tsdown --watch",
|
|
58
56
|
"test": "vitest run",
|
|
59
57
|
"test:lib": "vitest run",
|
|
60
58
|
"test:up": "vitest -u",
|