UniTok 3.0.11__tar.gz → 3.0.13__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- UniTok-3.0.13/PKG-INFO +142 -0
- UniTok-3.0.13/README.md +128 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/column.py +3 -3
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/meta.py +12 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/__init__.py +1 -1
- UniTok-3.0.11/UniTok/tok/entity_tok.py → UniTok-3.0.13/UniTok/tok/ent_tok.py +1 -1
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/unidep.py +22 -1
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/unitok.py +15 -2
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/vocab.py +20 -8
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/vocabs.py +1 -1
- UniTok-3.0.13/UniTok.egg-info/PKG-INFO +142 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok.egg-info/SOURCES.txt +1 -1
- {UniTok-3.0.11 → UniTok-3.0.13}/setup.py +1 -1
- UniTok-3.0.11/PKG-INFO +0 -208
- UniTok-3.0.11/README.md +0 -194
- UniTok-3.0.11/UniTok.egg-info/PKG-INFO +0 -208
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/__init__.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/analysis/__init__.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/analysis/lengths.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/analysis/plot.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/cols.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/global_setting.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/bert_tok.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/id_tok.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/number_tok.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/seq_tok.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/split_tok.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/tok.py +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok.egg-info/dependency_links.txt +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok.egg-info/requires.txt +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/UniTok.egg-info/top_level.txt +0 -0
- {UniTok-3.0.11 → UniTok-3.0.13}/setup.cfg +0 -0
UniTok-3.0.13/PKG-INFO
ADDED
@@ -0,0 +1,142 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: UniTok
|
3
|
+
Version: 3.0.13
|
4
|
+
Summary: Unified Tokenizer
|
5
|
+
Home-page: https://github.com/Jyonn/UnifiedTokenizer
|
6
|
+
Author: Jyonn Liu
|
7
|
+
Author-email: i@6-79.cn
|
8
|
+
License: MIT Licence
|
9
|
+
Keywords: token,tokenizer,bert
|
10
|
+
Platform: any
|
11
|
+
Description-Content-Type: text/markdown
|
12
|
+
|
13
|
+
# UniTok V3.0
|
14
|
+
|
15
|
+
## 介绍
|
16
|
+
|
17
|
+
UniTok是一个面向机器学习的统一文本数据预处理工具。它提供了一系列预定义的分词器,以便于处理不同类型的文本数据。UniTok简单易上手,让算法工程师能更专注优化算法本身,大大降低了数据预处理的难度。
|
18
|
+
UniDep是UniTok预处理后数据的解析工具,能和PyTorch的Dataset类配合使用。
|
19
|
+
|
20
|
+
## 安装
|
21
|
+
|
22
|
+
`pip install unitok>=3.0.11`
|
23
|
+
|
24
|
+
## 使用
|
25
|
+
|
26
|
+
我们以新闻推荐系统场景为例,数据集可能包含以下部分:
|
27
|
+
|
28
|
+
- 新闻内容数据`(news.tsv)`:每一行是一条新闻,包含新闻ID、新闻标题、摘要、类别、子类别等多个特征,用`\t`分隔。
|
29
|
+
- 用户历史数据`(user.tsv)`:每一行是一位用户,包含用户ID和用户历史点击新闻的ID列表,新闻ID用` `分隔。
|
30
|
+
- 交互数据:包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录,包含用户ID、新闻ID、是否点击,用`\t`分隔。
|
31
|
+
|
32
|
+
我们首先分析以上每个属性的数据类型:
|
33
|
+
|
34
|
+
| 文件 | 属性 | 类型 | 样例 | 备注 |
|
35
|
+
|-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
|
36
|
+
| news.tsv | nid | str | N1234 | 新闻ID,唯一标识 |
|
37
|
+
| news.tsv | title | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题,通常用BertTokenizer分词 |
|
38
|
+
| news.tsv | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now. | 新闻摘要,通常用BertTokenizer分词 |
|
39
|
+
| news.tsv | category | str | Technology | 新闻类别,不可分割 |
|
40
|
+
| news.tsv | subcat | str | Mobile | 新闻子类别,不可分割 |
|
41
|
+
| user.tsv | uid | str | U1234 | 用户ID,唯一标识 |
|
42
|
+
| user.tsv | history | str | N1234 N1235 N1236 | 用户历史,被` `分割 |
|
43
|
+
| train.tsv | uid | str | U1234 | 用户ID,与`user.tsv`一致 |
|
44
|
+
| train.tsv | nid | str | N1234 | 新闻ID,与`news.tsv`一致 |
|
45
|
+
| train.tsv | label | int | 1 | 是否点击,0表示未点击,1表示点击 |
|
46
|
+
|
47
|
+
我们可以对以上属性进行分类:
|
48
|
+
|
49
|
+
| 属性 | 类型 | 预设分词器 | 备注 |
|
50
|
+
|------------------|-----|-----------|-------------------------------------|
|
51
|
+
| nid, uid, index | str | IdTok | 唯一标识 |
|
52
|
+
| title, abstract | str | BertTok | 指定参数`vocab_dir="bert-base-uncased"` |
|
53
|
+
| category, subcat | str | EntityTok | 不可分割 |
|
54
|
+
| history | str | SplitTok | 指定参数`sep=' '` |
|
55
|
+
| label | int | NumberTok | 指定参数`vocab_size=2`,只有0和1两种情况 |
|
56
|
+
|
57
|
+
通过以下代码,我们可以针对每个文件构建一个UniTok对象:
|
58
|
+
|
59
|
+
```python
|
60
|
+
from UniTok import UniTok, Column, Vocab
|
61
|
+
from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok
|
62
|
+
|
63
|
+
nid_vocab = Vocab('nid') # 在新闻数据、历史数据和交互数据中都会用到
|
64
|
+
eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng') # 用于处理英文文本
|
65
|
+
|
66
|
+
news_ut = UniTok().add_col(Column(
|
67
|
+
name='nid', # 数据栏名称,如果和tok的name一致,可以省略
|
68
|
+
tok=IdTok(vocab=nid_vocab), # 指定分词器,每个UniTok对象必须有且只能有一个IdTok
|
69
|
+
)).add_col(Column(
|
70
|
+
name='title',
|
71
|
+
tok=eng_tok,
|
72
|
+
max_length=20, # 指定最大长度,超过的部分会被截断
|
73
|
+
)).add_col(Column(
|
74
|
+
name='abstract',
|
75
|
+
tok=eng_tok, # 摘要和标题使用同一个分词器
|
76
|
+
max_length=30, # 指定最大长度,超过的部分会被截断
|
77
|
+
)).add_col(Column(
|
78
|
+
name='category',
|
79
|
+
tok=EntTok(name='cat'), # 不显式指定Vocab,会根据name自动创建
|
80
|
+
)).add_col(Column(
|
81
|
+
name='subcat',
|
82
|
+
tok=EntTok(name='subcat'),
|
83
|
+
))
|
84
|
+
|
85
|
+
news_ut.read('news.tsv', sep='\t') # 读取数据文件
|
86
|
+
news_ut.tokenize() # 运行分词
|
87
|
+
news_ut.store('data/news') # 保存分词结果
|
88
|
+
|
89
|
+
uid_vocab = Vocab('uid') # 在用户数据和交互数据中都会用到
|
90
|
+
|
91
|
+
user_ut = UniTok().add_col(Column(
|
92
|
+
name='uid',
|
93
|
+
tok=IdTok(vocab=uid_vocab),
|
94
|
+
)).add_col(Column(
|
95
|
+
name='history',
|
96
|
+
tok=SplitTok(sep=' '), # 历史数据中的新闻ID用空格分割
|
97
|
+
))
|
98
|
+
|
99
|
+
user_ut.read('user.tsv', sep='\t') # 读取数据文件
|
100
|
+
user_ut.tokenize() # 运行分词
|
101
|
+
user_ut.store('data/user') # 保存分词结果
|
102
|
+
|
103
|
+
|
104
|
+
def inter_tokenize(mode):
|
105
|
+
# 由于train/dev/test的index不同,每次预处理前都需要重新构建UniTok对象
|
106
|
+
# 如果不重新构建,index词表可能不准确,导致元数据和真实数据不一致
|
107
|
+
# 但通过UniDep解析数据后,能修正index的误差
|
108
|
+
|
109
|
+
inter_ut = UniTok().add_index_col(
|
110
|
+
# 交互数据中的index列是自动生成的,不需要指定分词器
|
111
|
+
).add_col(Column(
|
112
|
+
name='uid',
|
113
|
+
tok=IdTok(vocab=uid_vocab), # 指定和user_ut中的uid列一致
|
114
|
+
)).add_col(Column(
|
115
|
+
name='nid',
|
116
|
+
tok=IdTok(vocab=nid_vocab), # 指定和news_ut中的nid列一致
|
117
|
+
)).add_col(Column(
|
118
|
+
name='label',
|
119
|
+
tok=NumberTok(vocab_size=2), # 0和1两种情况,>=3.0.11版本支持
|
120
|
+
))
|
121
|
+
|
122
|
+
inter_ut.read(f'{mode}.tsv', sep='\t') # 读取数据文件
|
123
|
+
inter_ut.tokenize() # 运行分词
|
124
|
+
inter_ut.store(mode) # 保存分词结果
|
125
|
+
|
126
|
+
|
127
|
+
inter_tokenize('data/train')
|
128
|
+
inter_tokenize('data/dev')
|
129
|
+
inter_tokenize('data/test')
|
130
|
+
```
|
131
|
+
|
132
|
+
我们可以用UniDep解析数据:
|
133
|
+
|
134
|
+
```python
|
135
|
+
from UniTok import UniDep
|
136
|
+
|
137
|
+
news_dep = UniDep('data/news') # 读取分词结果
|
138
|
+
print(len(news_dep))
|
139
|
+
print(news_dep[0])
|
140
|
+
```
|
141
|
+
|
142
|
+
|
UniTok-3.0.13/README.md
ADDED
@@ -0,0 +1,128 @@
|
|
1
|
+
# UniTok V3.0
|
2
|
+
|
3
|
+
## 介绍
|
4
|
+
|
5
|
+
UniTok是一个面向机器学习的统一文本数据预处理工具。它提供了一系列预定义的分词器,以便于处理不同类型的文本数据。UniTok简单易上手,让算法工程师能更专注优化算法本身,大大降低了数据预处理的难度。
|
6
|
+
UniDep是UniTok预处理后数据的解析工具,能和PyTorch的Dataset类配合使用。
|
7
|
+
|
8
|
+
## 安装
|
9
|
+
|
10
|
+
`pip install unitok>=3.0.11`
|
11
|
+
|
12
|
+
## 使用
|
13
|
+
|
14
|
+
我们以新闻推荐系统场景为例,数据集可能包含以下部分:
|
15
|
+
|
16
|
+
- 新闻内容数据`(news.tsv)`:每一行是一条新闻,包含新闻ID、新闻标题、摘要、类别、子类别等多个特征,用`\t`分隔。
|
17
|
+
- 用户历史数据`(user.tsv)`:每一行是一位用户,包含用户ID和用户历史点击新闻的ID列表,新闻ID用` `分隔。
|
18
|
+
- 交互数据:包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录,包含用户ID、新闻ID、是否点击,用`\t`分隔。
|
19
|
+
|
20
|
+
我们首先分析以上每个属性的数据类型:
|
21
|
+
|
22
|
+
| 文件 | 属性 | 类型 | 样例 | 备注 |
|
23
|
+
|-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
|
24
|
+
| news.tsv | nid | str | N1234 | 新闻ID,唯一标识 |
|
25
|
+
| news.tsv | title | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题,通常用BertTokenizer分词 |
|
26
|
+
| news.tsv | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now. | 新闻摘要,通常用BertTokenizer分词 |
|
27
|
+
| news.tsv | category | str | Technology | 新闻类别,不可分割 |
|
28
|
+
| news.tsv | subcat | str | Mobile | 新闻子类别,不可分割 |
|
29
|
+
| user.tsv | uid | str | U1234 | 用户ID,唯一标识 |
|
30
|
+
| user.tsv | history | str | N1234 N1235 N1236 | 用户历史,被` `分割 |
|
31
|
+
| train.tsv | uid | str | U1234 | 用户ID,与`user.tsv`一致 |
|
32
|
+
| train.tsv | nid | str | N1234 | 新闻ID,与`news.tsv`一致 |
|
33
|
+
| train.tsv | label | int | 1 | 是否点击,0表示未点击,1表示点击 |
|
34
|
+
|
35
|
+
我们可以对以上属性进行分类:
|
36
|
+
|
37
|
+
| 属性 | 类型 | 预设分词器 | 备注 |
|
38
|
+
|------------------|-----|-----------|-------------------------------------|
|
39
|
+
| nid, uid, index | str | IdTok | 唯一标识 |
|
40
|
+
| title, abstract | str | BertTok | 指定参数`vocab_dir="bert-base-uncased"` |
|
41
|
+
| category, subcat | str | EntityTok | 不可分割 |
|
42
|
+
| history | str | SplitTok | 指定参数`sep=' '` |
|
43
|
+
| label | int | NumberTok | 指定参数`vocab_size=2`,只有0和1两种情况 |
|
44
|
+
|
45
|
+
通过以下代码,我们可以针对每个文件构建一个UniTok对象:
|
46
|
+
|
47
|
+
```python
|
48
|
+
from UniTok import UniTok, Column, Vocab
|
49
|
+
from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok
|
50
|
+
|
51
|
+
nid_vocab = Vocab('nid') # 在新闻数据、历史数据和交互数据中都会用到
|
52
|
+
eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng') # 用于处理英文文本
|
53
|
+
|
54
|
+
news_ut = UniTok().add_col(Column(
|
55
|
+
name='nid', # 数据栏名称,如果和tok的name一致,可以省略
|
56
|
+
tok=IdTok(vocab=nid_vocab), # 指定分词器,每个UniTok对象必须有且只能有一个IdTok
|
57
|
+
)).add_col(Column(
|
58
|
+
name='title',
|
59
|
+
tok=eng_tok,
|
60
|
+
max_length=20, # 指定最大长度,超过的部分会被截断
|
61
|
+
)).add_col(Column(
|
62
|
+
name='abstract',
|
63
|
+
tok=eng_tok, # 摘要和标题使用同一个分词器
|
64
|
+
max_length=30, # 指定最大长度,超过的部分会被截断
|
65
|
+
)).add_col(Column(
|
66
|
+
name='category',
|
67
|
+
tok=EntTok(name='cat'), # 不显式指定Vocab,会根据name自动创建
|
68
|
+
)).add_col(Column(
|
69
|
+
name='subcat',
|
70
|
+
tok=EntTok(name='subcat'),
|
71
|
+
))
|
72
|
+
|
73
|
+
news_ut.read('news.tsv', sep='\t') # 读取数据文件
|
74
|
+
news_ut.tokenize() # 运行分词
|
75
|
+
news_ut.store('data/news') # 保存分词结果
|
76
|
+
|
77
|
+
uid_vocab = Vocab('uid') # 在用户数据和交互数据中都会用到
|
78
|
+
|
79
|
+
user_ut = UniTok().add_col(Column(
|
80
|
+
name='uid',
|
81
|
+
tok=IdTok(vocab=uid_vocab),
|
82
|
+
)).add_col(Column(
|
83
|
+
name='history',
|
84
|
+
tok=SplitTok(sep=' '), # 历史数据中的新闻ID用空格分割
|
85
|
+
))
|
86
|
+
|
87
|
+
user_ut.read('user.tsv', sep='\t') # 读取数据文件
|
88
|
+
user_ut.tokenize() # 运行分词
|
89
|
+
user_ut.store('data/user') # 保存分词结果
|
90
|
+
|
91
|
+
|
92
|
+
def inter_tokenize(mode):
|
93
|
+
# 由于train/dev/test的index不同,每次预处理前都需要重新构建UniTok对象
|
94
|
+
# 如果不重新构建,index词表可能不准确,导致元数据和真实数据不一致
|
95
|
+
# 但通过UniDep解析数据后,能修正index的误差
|
96
|
+
|
97
|
+
inter_ut = UniTok().add_index_col(
|
98
|
+
# 交互数据中的index列是自动生成的,不需要指定分词器
|
99
|
+
).add_col(Column(
|
100
|
+
name='uid',
|
101
|
+
tok=IdTok(vocab=uid_vocab), # 指定和user_ut中的uid列一致
|
102
|
+
)).add_col(Column(
|
103
|
+
name='nid',
|
104
|
+
tok=IdTok(vocab=nid_vocab), # 指定和news_ut中的nid列一致
|
105
|
+
)).add_col(Column(
|
106
|
+
name='label',
|
107
|
+
tok=NumberTok(vocab_size=2), # 0和1两种情况,>=3.0.11版本支持
|
108
|
+
))
|
109
|
+
|
110
|
+
inter_ut.read(f'{mode}.tsv', sep='\t') # 读取数据文件
|
111
|
+
inter_ut.tokenize() # 运行分词
|
112
|
+
inter_ut.store(mode) # 保存分词结果
|
113
|
+
|
114
|
+
|
115
|
+
inter_tokenize('data/train')
|
116
|
+
inter_tokenize('data/dev')
|
117
|
+
inter_tokenize('data/test')
|
118
|
+
```
|
119
|
+
|
120
|
+
我们可以用UniDep解析数据:
|
121
|
+
|
122
|
+
```python
|
123
|
+
from UniTok import UniDep
|
124
|
+
|
125
|
+
news_dep = UniDep('data/news') # 读取分词结果
|
126
|
+
print(len(news_dep))
|
127
|
+
print(news_dep[0])
|
128
|
+
```
|
@@ -42,9 +42,9 @@ class Column:
|
|
42
42
|
tok (BaseTok): The tokenizer of the column.
|
43
43
|
operator (SeqOperator): The operator of the column.
|
44
44
|
"""
|
45
|
-
def __init__(self,
|
46
|
-
self.name = name
|
45
|
+
def __init__(self, tok: BaseTok, name=None, operator: SeqOperator = None, **kwargs):
|
47
46
|
self.tok = tok
|
47
|
+
self.name = name or tok.vocab.name
|
48
48
|
self.operator = operator
|
49
49
|
|
50
50
|
if kwargs:
|
@@ -115,4 +115,4 @@ class Column:
|
|
115
115
|
|
116
116
|
class IndexColumn(Column):
|
117
117
|
def __init__(self, name='index'):
|
118
|
-
super().__init__(
|
118
|
+
super().__init__(tok=IdTok(name=name))
|
@@ -127,3 +127,15 @@ class Meta:
|
|
127
127
|
print('Old meta data backed up to {}.'.format(self.path + '.bak'))
|
128
128
|
self.save()
|
129
129
|
print('Meta data upgraded.')
|
130
|
+
|
131
|
+
@property
|
132
|
+
def col_info(self):
|
133
|
+
warnings.warn('col_info is deprecated, use cols instead.'
|
134
|
+
'(meta.col_info -> meta.cols)', DeprecationWarning)
|
135
|
+
return self.cols
|
136
|
+
|
137
|
+
@property
|
138
|
+
def vocab_info(self):
|
139
|
+
warnings.warn('vocab_info is deprecated, use vocs instead.'
|
140
|
+
'(meta.vocab_info -> meta.vocs)', DeprecationWarning)
|
141
|
+
return self.vocs
|
@@ -104,7 +104,28 @@ class UniDep:
|
|
104
104
|
return self.sample_size
|
105
105
|
|
106
106
|
def __str__(self):
|
107
|
-
|
107
|
+
""" UniDep (dir):
|
108
|
+
|
109
|
+
Sample Size: 1000
|
110
|
+
Id Column: id
|
111
|
+
Columns:
|
112
|
+
id, vocab index (size 1000)
|
113
|
+
text, vocab eng (size 30522), max length 100
|
114
|
+
label, vocab label (size 2)
|
115
|
+
"""
|
116
|
+
introduction = f"""
|
117
|
+
UniDep ({self.meta.parse_version(self.meta.version)}): {self.store_dir}
|
118
|
+
|
119
|
+
Sample Size: {self.sample_size}
|
120
|
+
Id Column: {self.id_col}
|
121
|
+
Columns:\n"""
|
122
|
+
|
123
|
+
for col_name, col in self.cols.items(): # type: str, Col
|
124
|
+
introduction += f' \t{col_name}, vocab {col.voc.name} (size {col.voc.size})'
|
125
|
+
if col.max_length:
|
126
|
+
introduction += f', max length {col.max_length}'
|
127
|
+
introduction += '\n'
|
128
|
+
return introduction
|
108
129
|
|
109
130
|
def __repr__(self):
|
110
131
|
return str(self)
|
@@ -9,7 +9,7 @@ import pandas as pd
|
|
9
9
|
from .cols import Cols
|
10
10
|
from .column import Column, IndexColumn
|
11
11
|
from .tok.bert_tok import BertTok
|
12
|
-
from .tok.
|
12
|
+
from .tok.ent_tok import EntTok
|
13
13
|
from .tok.id_tok import IdTok
|
14
14
|
from .vocab import Vocab
|
15
15
|
from .vocabs import Vocabs
|
@@ -97,7 +97,7 @@ class UniTok:
|
|
97
97
|
self.id_col = col
|
98
98
|
return self
|
99
99
|
|
100
|
-
def
|
100
|
+
def read(self, df, sep=None):
|
101
101
|
"""
|
102
102
|
Read data from a file
|
103
103
|
"""
|
@@ -107,6 +107,11 @@ class UniTok:
|
|
107
107
|
self.data = df
|
108
108
|
return self
|
109
109
|
|
110
|
+
def read_file(self, df, sep=None):
|
111
|
+
warnings.warn('read_file is deprecated, use read instead '
|
112
|
+
'(will be removed in 4.x version)', DeprecationWarning)
|
113
|
+
return self.read(df, sep)
|
114
|
+
|
110
115
|
def __getitem__(self, col):
|
111
116
|
"""
|
112
117
|
Get the data of a column
|
@@ -166,6 +171,14 @@ class UniTok:
|
|
166
171
|
return self.cols[col_name].tok.vocab.get_store_path(store_dir)
|
167
172
|
|
168
173
|
def store_data(self, store_dir):
|
174
|
+
"""
|
175
|
+
Store the tokenized data
|
176
|
+
"""
|
177
|
+
warnings.warn('unitok.store_data is deprecated, use store instead '
|
178
|
+
'(will be removed in 4.x version)', DeprecationWarning)
|
179
|
+
self.store(store_dir)
|
180
|
+
|
181
|
+
def store(self, store_dir):
|
169
182
|
"""
|
170
183
|
Store the tokenized data
|
171
184
|
"""
|
@@ -94,7 +94,7 @@ class Vocab:
|
|
94
94
|
"""
|
95
95
|
set first n tokens as reserved tokens
|
96
96
|
"""
|
97
|
-
if self
|
97
|
+
if len(self):
|
98
98
|
raise ValueError(f'vocab {self.name} is not empty, can not reserve tokens')
|
99
99
|
|
100
100
|
self.reserved_tokens = tokens
|
@@ -107,17 +107,25 @@ class Vocab:
|
|
107
107
|
return self
|
108
108
|
|
109
109
|
def get_tokens(self):
|
110
|
-
|
110
|
+
warnings.warn('vocab.get_tokens is deprecated, '
|
111
|
+
'use list(vocab) instead (will be removed in 4.x version)', DeprecationWarning)
|
112
|
+
return list(self)
|
111
113
|
|
112
114
|
def get_size(self):
|
113
|
-
|
115
|
+
warnings.warn('vocab.get_size is deprecated, '
|
116
|
+
'use len(vocab) instead (will be removed in 4.x version)', DeprecationWarning)
|
117
|
+
return len(self)
|
114
118
|
|
115
119
|
def __len__(self):
|
116
|
-
return self.
|
120
|
+
return len(self.i2o)
|
117
121
|
|
118
122
|
def __bool__(self):
|
119
123
|
return True
|
120
124
|
|
125
|
+
def __iter__(self):
|
126
|
+
for i in range(len(self)):
|
127
|
+
yield self.i2o[i]
|
128
|
+
|
121
129
|
"""
|
122
130
|
Editable Methods
|
123
131
|
"""
|
@@ -163,8 +171,8 @@ class Vocab:
|
|
163
171
|
def save(self, store_dir):
|
164
172
|
store_path = self.get_store_path(store_dir)
|
165
173
|
with open(store_path, 'w') as f:
|
166
|
-
for
|
167
|
-
f.write('{}\n'.format(
|
174
|
+
for token in self:
|
175
|
+
f.write('{}\n'.format(token))
|
168
176
|
|
169
177
|
return self
|
170
178
|
|
@@ -184,8 +192,10 @@ class Vocab:
|
|
184
192
|
def trim(self, min_count=None, min_frequency=1):
|
185
193
|
"""
|
186
194
|
trim vocab by min frequency
|
187
|
-
:return:
|
195
|
+
:return: trimmed tokens
|
188
196
|
"""
|
197
|
+
_trimmed = []
|
198
|
+
|
189
199
|
if min_count is None:
|
190
200
|
warnings.warn('vocab.min_frequency is deprecated, '
|
191
201
|
'use vocab.min_count instead (will be removed in 4.x version)', DeprecationWarning)
|
@@ -195,6 +205,8 @@ class Vocab:
|
|
195
205
|
for index in self._counter:
|
196
206
|
if self._counter[index] >= min_count:
|
197
207
|
vocabs.append(self.i2o[index])
|
208
|
+
else:
|
209
|
+
_trimmed.append(self.i2o[index])
|
198
210
|
|
199
211
|
self.i2o = dict()
|
200
212
|
self.o2i = dict()
|
@@ -205,7 +217,7 @@ class Vocab:
|
|
205
217
|
self.extend(vocabs)
|
206
218
|
|
207
219
|
self._stable_mode = True
|
208
|
-
return
|
220
|
+
return _trimmed
|
209
221
|
|
210
222
|
def summarize(self, base=10):
|
211
223
|
"""
|
@@ -0,0 +1,142 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: UniTok
|
3
|
+
Version: 3.0.13
|
4
|
+
Summary: Unified Tokenizer
|
5
|
+
Home-page: https://github.com/Jyonn/UnifiedTokenizer
|
6
|
+
Author: Jyonn Liu
|
7
|
+
Author-email: i@6-79.cn
|
8
|
+
License: MIT Licence
|
9
|
+
Keywords: token,tokenizer,bert
|
10
|
+
Platform: any
|
11
|
+
Description-Content-Type: text/markdown
|
12
|
+
|
13
|
+
# UniTok V3.0
|
14
|
+
|
15
|
+
## 介绍
|
16
|
+
|
17
|
+
UniTok是一个面向机器学习的统一文本数据预处理工具。它提供了一系列预定义的分词器,以便于处理不同类型的文本数据。UniTok简单易上手,让算法工程师能更专注优化算法本身,大大降低了数据预处理的难度。
|
18
|
+
UniDep是UniTok预处理后数据的解析工具,能和PyTorch的Dataset类配合使用。
|
19
|
+
|
20
|
+
## 安装
|
21
|
+
|
22
|
+
`pip install unitok>=3.0.11`
|
23
|
+
|
24
|
+
## 使用
|
25
|
+
|
26
|
+
我们以新闻推荐系统场景为例,数据集可能包含以下部分:
|
27
|
+
|
28
|
+
- 新闻内容数据`(news.tsv)`:每一行是一条新闻,包含新闻ID、新闻标题、摘要、类别、子类别等多个特征,用`\t`分隔。
|
29
|
+
- 用户历史数据`(user.tsv)`:每一行是一位用户,包含用户ID和用户历史点击新闻的ID列表,新闻ID用` `分隔。
|
30
|
+
- 交互数据:包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录,包含用户ID、新闻ID、是否点击,用`\t`分隔。
|
31
|
+
|
32
|
+
我们首先分析以上每个属性的数据类型:
|
33
|
+
|
34
|
+
| 文件 | 属性 | 类型 | 样例 | 备注 |
|
35
|
+
|-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
|
36
|
+
| news.tsv | nid | str | N1234 | 新闻ID,唯一标识 |
|
37
|
+
| news.tsv | title | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题,通常用BertTokenizer分词 |
|
38
|
+
| news.tsv | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now. | 新闻摘要,通常用BertTokenizer分词 |
|
39
|
+
| news.tsv | category | str | Technology | 新闻类别,不可分割 |
|
40
|
+
| news.tsv | subcat | str | Mobile | 新闻子类别,不可分割 |
|
41
|
+
| user.tsv | uid | str | U1234 | 用户ID,唯一标识 |
|
42
|
+
| user.tsv | history | str | N1234 N1235 N1236 | 用户历史,被` `分割 |
|
43
|
+
| train.tsv | uid | str | U1234 | 用户ID,与`user.tsv`一致 |
|
44
|
+
| train.tsv | nid | str | N1234 | 新闻ID,与`news.tsv`一致 |
|
45
|
+
| train.tsv | label | int | 1 | 是否点击,0表示未点击,1表示点击 |
|
46
|
+
|
47
|
+
我们可以对以上属性进行分类:
|
48
|
+
|
49
|
+
| 属性 | 类型 | 预设分词器 | 备注 |
|
50
|
+
|------------------|-----|-----------|-------------------------------------|
|
51
|
+
| nid, uid, index | str | IdTok | 唯一标识 |
|
52
|
+
| title, abstract | str | BertTok | 指定参数`vocab_dir="bert-base-uncased"` |
|
53
|
+
| category, subcat | str | EntityTok | 不可分割 |
|
54
|
+
| history | str | SplitTok | 指定参数`sep=' '` |
|
55
|
+
| label | int | NumberTok | 指定参数`vocab_size=2`,只有0和1两种情况 |
|
56
|
+
|
57
|
+
通过以下代码,我们可以针对每个文件构建一个UniTok对象:
|
58
|
+
|
59
|
+
```python
|
60
|
+
from UniTok import UniTok, Column, Vocab
|
61
|
+
from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok
|
62
|
+
|
63
|
+
nid_vocab = Vocab('nid') # 在新闻数据、历史数据和交互数据中都会用到
|
64
|
+
eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng') # 用于处理英文文本
|
65
|
+
|
66
|
+
news_ut = UniTok().add_col(Column(
|
67
|
+
name='nid', # 数据栏名称,如果和tok的name一致,可以省略
|
68
|
+
tok=IdTok(vocab=nid_vocab), # 指定分词器,每个UniTok对象必须有且只能有一个IdTok
|
69
|
+
)).add_col(Column(
|
70
|
+
name='title',
|
71
|
+
tok=eng_tok,
|
72
|
+
max_length=20, # 指定最大长度,超过的部分会被截断
|
73
|
+
)).add_col(Column(
|
74
|
+
name='abstract',
|
75
|
+
tok=eng_tok, # 摘要和标题使用同一个分词器
|
76
|
+
max_length=30, # 指定最大长度,超过的部分会被截断
|
77
|
+
)).add_col(Column(
|
78
|
+
name='category',
|
79
|
+
tok=EntTok(name='cat'), # 不显式指定Vocab,会根据name自动创建
|
80
|
+
)).add_col(Column(
|
81
|
+
name='subcat',
|
82
|
+
tok=EntTok(name='subcat'),
|
83
|
+
))
|
84
|
+
|
85
|
+
news_ut.read('news.tsv', sep='\t') # 读取数据文件
|
86
|
+
news_ut.tokenize() # 运行分词
|
87
|
+
news_ut.store('data/news') # 保存分词结果
|
88
|
+
|
89
|
+
uid_vocab = Vocab('uid') # 在用户数据和交互数据中都会用到
|
90
|
+
|
91
|
+
user_ut = UniTok().add_col(Column(
|
92
|
+
name='uid',
|
93
|
+
tok=IdTok(vocab=uid_vocab),
|
94
|
+
)).add_col(Column(
|
95
|
+
name='history',
|
96
|
+
tok=SplitTok(sep=' '), # 历史数据中的新闻ID用空格分割
|
97
|
+
))
|
98
|
+
|
99
|
+
user_ut.read('user.tsv', sep='\t') # 读取数据文件
|
100
|
+
user_ut.tokenize() # 运行分词
|
101
|
+
user_ut.store('data/user') # 保存分词结果
|
102
|
+
|
103
|
+
|
104
|
+
def inter_tokenize(mode):
|
105
|
+
# 由于train/dev/test的index不同,每次预处理前都需要重新构建UniTok对象
|
106
|
+
# 如果不重新构建,index词表可能不准确,导致元数据和真实数据不一致
|
107
|
+
# 但通过UniDep解析数据后,能修正index的误差
|
108
|
+
|
109
|
+
inter_ut = UniTok().add_index_col(
|
110
|
+
# 交互数据中的index列是自动生成的,不需要指定分词器
|
111
|
+
).add_col(Column(
|
112
|
+
name='uid',
|
113
|
+
tok=IdTok(vocab=uid_vocab), # 指定和user_ut中的uid列一致
|
114
|
+
)).add_col(Column(
|
115
|
+
name='nid',
|
116
|
+
tok=IdTok(vocab=nid_vocab), # 指定和news_ut中的nid列一致
|
117
|
+
)).add_col(Column(
|
118
|
+
name='label',
|
119
|
+
tok=NumberTok(vocab_size=2), # 0和1两种情况,>=3.0.11版本支持
|
120
|
+
))
|
121
|
+
|
122
|
+
inter_ut.read(f'{mode}.tsv', sep='\t') # 读取数据文件
|
123
|
+
inter_ut.tokenize() # 运行分词
|
124
|
+
inter_ut.store(mode) # 保存分词结果
|
125
|
+
|
126
|
+
|
127
|
+
inter_tokenize('data/train')
|
128
|
+
inter_tokenize('data/dev')
|
129
|
+
inter_tokenize('data/test')
|
130
|
+
```
|
131
|
+
|
132
|
+
我们可以用UniDep解析数据:
|
133
|
+
|
134
|
+
```python
|
135
|
+
from UniTok import UniDep
|
136
|
+
|
137
|
+
news_dep = UniDep('data/news') # 读取分词结果
|
138
|
+
print(len(news_dep))
|
139
|
+
print(news_dep[0])
|
140
|
+
```
|
141
|
+
|
142
|
+
|
UniTok-3.0.11/PKG-INFO
DELETED
@@ -1,208 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.1
|
2
|
-
Name: UniTok
|
3
|
-
Version: 3.0.11
|
4
|
-
Summary: Unified Tokenizer
|
5
|
-
Home-page: https://github.com/Jyonn/UnifiedTokenizer
|
6
|
-
Author: Jyonn Liu
|
7
|
-
Author-email: i@6-79.cn
|
8
|
-
License: MIT Licence
|
9
|
-
Keywords: token,tokenizer,bert
|
10
|
-
Platform: any
|
11
|
-
Description-Content-Type: text/markdown
|
12
|
-
|
13
|
-
# Unified Tokenizer
|
14
|
-
|
15
|
-
> Instructions for **3.0** version will be updated soon!
|
16
|
-
|
17
|
-
## Introduction
|
18
|
-
|
19
|
-
Unified Tokenizer, shortly **UniTok**,
|
20
|
-
offers pre-defined various tokenizers in dealing with textual data.
|
21
|
-
It is a central data processing tool
|
22
|
-
that allows algorithm engineers to focus more on the algorithm itself
|
23
|
-
instead of tedious data preprocessing.
|
24
|
-
|
25
|
-
It incorporates the BERT tokenizer from the [transformers]((https://github.com/huggingface/transformers)) library,
|
26
|
-
while it supports custom via the general word segmentation module (i.e., the `BaseTok` class).
|
27
|
-
|
28
|
-
## Installation
|
29
|
-
|
30
|
-
`pip install UnifiedTokenizer`
|
31
|
-
|
32
|
-
## Usage
|
33
|
-
|
34
|
-
We use the head of the training set of the [MINDlarge](https://msnews.github.io/) dataset as an example (see `news-sample.tsv` file).
|
35
|
-
|
36
|
-
### Data Declaration (more info see [MIND GitHub](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md))
|
37
|
-
|
38
|
-
Each line in the file is a piece of news, including 7 features, which are divided by the tab (`\t`) symbol:
|
39
|
-
|
40
|
-
- News ID
|
41
|
-
- Category
|
42
|
-
- SubCategory
|
43
|
-
- Title
|
44
|
-
- Abstract
|
45
|
-
- URL
|
46
|
-
- Title Entities (entities contained in the title of this news)
|
47
|
-
- Abstract Entities (entities contained in the abstract of this news)
|
48
|
-
|
49
|
-
We only use its first 5 columns for demonstration.
|
50
|
-
|
51
|
-
### Pre-defined Tokenizers
|
52
|
-
|
53
|
-
| Tokenizer | Description | Parameters |
|
54
|
-
|-----------|-----------------------------------------------------------------------|-------------|
|
55
|
-
| BertTok | Provided by the ``transformers` library, using the WordPiece strategy | `vocab_dir` |
|
56
|
-
| EntTok | The column data is regarded as an entire token | / |
|
57
|
-
| IdTok | A specific version of EntTok, required to be identical | / |
|
58
|
-
| SplitTok | Tokens are joined by separators like tab, space | `sep` |
|
59
|
-
|
60
|
-
### Imports
|
61
|
-
|
62
|
-
```python
|
63
|
-
import pandas as pd
|
64
|
-
|
65
|
-
|
66
|
-
from UniTok import UniTok, Column
|
67
|
-
from UniTok.tok import IdTok, EntTok, BertTok
|
68
|
-
```
|
69
|
-
|
70
|
-
### Read data
|
71
|
-
|
72
|
-
```python
|
73
|
-
import pandas as pd
|
74
|
-
|
75
|
-
df = pd.read_csv(
|
76
|
-
filepath_or_buffer='path/news-sample.tsv',
|
77
|
-
sep='\t',
|
78
|
-
names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
|
79
|
-
usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
|
80
|
-
)
|
81
|
-
```
|
82
|
-
|
83
|
-
### Construct UniTok
|
84
|
-
|
85
|
-
```python
|
86
|
-
cat_tok = EntTok(name='cat') # one tokenizer for both cat and subCat
|
87
|
-
text_tok = BertTok(name='english', vocab_dir='bert-base-uncased') # specify the bert vocab
|
88
|
-
|
89
|
-
unitok = UniTok().add_index_col(
|
90
|
-
name='nid'
|
91
|
-
).add_col(Column(
|
92
|
-
name='cat',
|
93
|
-
tok=cat_tok
|
94
|
-
)).add_col(Column(
|
95
|
-
name='subCat',
|
96
|
-
tok=cat_tok,
|
97
|
-
)).add_col(Column(
|
98
|
-
name='title',
|
99
|
-
tok=text_tok,
|
100
|
-
)).add_col(Column(
|
101
|
-
name='abs',
|
102
|
-
tok=text_tok,
|
103
|
-
)).read_file(df)
|
104
|
-
```
|
105
|
-
|
106
|
-
### Analyse Data
|
107
|
-
|
108
|
-
```python
|
109
|
-
unitok.analyse()
|
110
|
-
```
|
111
|
-
|
112
|
-
It shows the distribution of the length of each column (if using _ListTokenizer_). It will help us determine the _max_length_ of the tokens for each column.
|
113
|
-
|
114
|
-
```
|
115
|
-
[ COLUMNS ]
|
116
|
-
[ COL: nid ]
|
117
|
-
[NOT ListTokenizer]
|
118
|
-
|
119
|
-
[ COL: cat ]
|
120
|
-
[NOT ListTokenizer]
|
121
|
-
|
122
|
-
[ COL: subCat ]
|
123
|
-
[NOT ListTokenizer]
|
124
|
-
|
125
|
-
[ COL: title ]
|
126
|
-
[ MIN: 6 ]
|
127
|
-
[ MAX: 16 ]
|
128
|
-
[ AVG: 12 ]
|
129
|
-
[ X-INT: 1 ]
|
130
|
-
[ Y-INT: 0 ]
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|| |
|
136
|
-
|| |
|
137
|
-
|| |
|
138
|
-
| | | || |
|
139
|
-
| | | || |
|
140
|
-
| | | || |
|
141
|
-
-----------
|
142
|
-
|
143
|
-
[ COL: abs ]
|
144
|
-
100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
|
145
|
-
100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
|
146
|
-
100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
|
147
|
-
100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
|
148
|
-
100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
|
149
|
-
[ MIN: 0 ]
|
150
|
-
[ MAX: 46 ]
|
151
|
-
[ AVG: 21 ]
|
152
|
-
[ X-INT: 1 ]
|
153
|
-
[ Y-INT: 0 ]
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
| | | || || | |
|
160
|
-
| | | || || | |
|
161
|
-
| | | || || | |
|
162
|
-
| | | || || | |
|
163
|
-
| | | || || | |
|
164
|
-
-----------------------------------------------
|
165
|
-
|
166
|
-
[ VOCABS ]
|
167
|
-
[ VOC: news with 10 tokens ]
|
168
|
-
[ COL: nid ]
|
169
|
-
|
170
|
-
[ VOC: cat with 112 tokens ]
|
171
|
-
[ COL: cat, subCat ]
|
172
|
-
|
173
|
-
[ VOC: english with 30522 tokens ]
|
174
|
-
[ COL: title, abs ]
|
175
|
-
```
|
176
|
-
|
177
|
-
### ReConstruct Unified Tokenizer
|
178
|
-
|
179
|
-
```python
|
180
|
-
unitok = UniTok().add_index_col(
|
181
|
-
name='nid'
|
182
|
-
).add_col(Column(
|
183
|
-
name='cat',
|
184
|
-
tok=cat_tok.as_sing()
|
185
|
-
)).add_col(Column(
|
186
|
-
name='subCat',
|
187
|
-
tok=cat_tok.as_sing(),
|
188
|
-
)).add_col(Column(
|
189
|
-
name='title',
|
190
|
-
tok=text_tok,
|
191
|
-
operator=SeqOperator(max_length=20),
|
192
|
-
)).add_col(Column(
|
193
|
-
name='abs',
|
194
|
-
tok=text_tok,
|
195
|
-
operator=SeqOperator(max_length=30),
|
196
|
-
)).read_file(df)
|
197
|
-
```
|
198
|
-
|
199
|
-
In this step, we set _max_length_ of each column. If _max_length_ is not set, we will keep the **whole** sequence and not truncate it.
|
200
|
-
|
201
|
-
### Tokenize and Store
|
202
|
-
|
203
|
-
```python
|
204
|
-
unitok.tokenize()
|
205
|
-
unitok.store_data('TokenizedData')
|
206
|
-
```
|
207
|
-
|
208
|
-
|
UniTok-3.0.11/README.md
DELETED
@@ -1,194 +0,0 @@
|
|
1
|
-
# Unified Tokenizer
|
2
|
-
|
3
|
-
> Instructions for **3.0** version will be updated soon!
|
4
|
-
|
5
|
-
## Introduction
|
6
|
-
|
7
|
-
Unified Tokenizer, shortly **UniTok**,
|
8
|
-
offers pre-defined various tokenizers in dealing with textual data.
|
9
|
-
It is a central data processing tool
|
10
|
-
that allows algorithm engineers to focus more on the algorithm itself
|
11
|
-
instead of tedious data preprocessing.
|
12
|
-
|
13
|
-
It incorporates the BERT tokenizer from the [transformers]((https://github.com/huggingface/transformers)) library,
|
14
|
-
while it supports custom via the general word segmentation module (i.e., the `BaseTok` class).
|
15
|
-
|
16
|
-
## Installation
|
17
|
-
|
18
|
-
`pip install UnifiedTokenizer`
|
19
|
-
|
20
|
-
## Usage
|
21
|
-
|
22
|
-
We use the head of the training set of the [MINDlarge](https://msnews.github.io/) dataset as an example (see `news-sample.tsv` file).
|
23
|
-
|
24
|
-
### Data Declaration (more info see [MIND GitHub](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md))
|
25
|
-
|
26
|
-
Each line in the file is a piece of news, including 7 features, which are divided by the tab (`\t`) symbol:
|
27
|
-
|
28
|
-
- News ID
|
29
|
-
- Category
|
30
|
-
- SubCategory
|
31
|
-
- Title
|
32
|
-
- Abstract
|
33
|
-
- URL
|
34
|
-
- Title Entities (entities contained in the title of this news)
|
35
|
-
- Abstract Entities (entities contained in the abstract of this news)
|
36
|
-
|
37
|
-
We only use its first 5 columns for demonstration.
|
38
|
-
|
39
|
-
### Pre-defined Tokenizers
|
40
|
-
|
41
|
-
| Tokenizer | Description | Parameters |
|
42
|
-
|-----------|-----------------------------------------------------------------------|-------------|
|
43
|
-
| BertTok | Provided by the ``transformers` library, using the WordPiece strategy | `vocab_dir` |
|
44
|
-
| EntTok | The column data is regarded as an entire token | / |
|
45
|
-
| IdTok | A specific version of EntTok, required to be identical | / |
|
46
|
-
| SplitTok | Tokens are joined by separators like tab, space | `sep` |
|
47
|
-
|
48
|
-
### Imports
|
49
|
-
|
50
|
-
```python
|
51
|
-
import pandas as pd
|
52
|
-
|
53
|
-
|
54
|
-
from UniTok import UniTok, Column
|
55
|
-
from UniTok.tok import IdTok, EntTok, BertTok
|
56
|
-
```
|
57
|
-
|
58
|
-
### Read data
|
59
|
-
|
60
|
-
```python
|
61
|
-
import pandas as pd
|
62
|
-
|
63
|
-
df = pd.read_csv(
|
64
|
-
filepath_or_buffer='path/news-sample.tsv',
|
65
|
-
sep='\t',
|
66
|
-
names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
|
67
|
-
usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
|
68
|
-
)
|
69
|
-
```
|
70
|
-
|
71
|
-
### Construct UniTok
|
72
|
-
|
73
|
-
```python
|
74
|
-
cat_tok = EntTok(name='cat') # one tokenizer for both cat and subCat
|
75
|
-
text_tok = BertTok(name='english', vocab_dir='bert-base-uncased') # specify the bert vocab
|
76
|
-
|
77
|
-
unitok = UniTok().add_index_col(
|
78
|
-
name='nid'
|
79
|
-
).add_col(Column(
|
80
|
-
name='cat',
|
81
|
-
tok=cat_tok
|
82
|
-
)).add_col(Column(
|
83
|
-
name='subCat',
|
84
|
-
tok=cat_tok,
|
85
|
-
)).add_col(Column(
|
86
|
-
name='title',
|
87
|
-
tok=text_tok,
|
88
|
-
)).add_col(Column(
|
89
|
-
name='abs',
|
90
|
-
tok=text_tok,
|
91
|
-
)).read_file(df)
|
92
|
-
```
|
93
|
-
|
94
|
-
### Analyse Data
|
95
|
-
|
96
|
-
```python
|
97
|
-
unitok.analyse()
|
98
|
-
```
|
99
|
-
|
100
|
-
It shows the distribution of the length of each column (if using _ListTokenizer_). It will help us determine the _max_length_ of the tokens for each column.
|
101
|
-
|
102
|
-
```
|
103
|
-
[ COLUMNS ]
|
104
|
-
[ COL: nid ]
|
105
|
-
[NOT ListTokenizer]
|
106
|
-
|
107
|
-
[ COL: cat ]
|
108
|
-
[NOT ListTokenizer]
|
109
|
-
|
110
|
-
[ COL: subCat ]
|
111
|
-
[NOT ListTokenizer]
|
112
|
-
|
113
|
-
[ COL: title ]
|
114
|
-
[ MIN: 6 ]
|
115
|
-
[ MAX: 16 ]
|
116
|
-
[ AVG: 12 ]
|
117
|
-
[ X-INT: 1 ]
|
118
|
-
[ Y-INT: 0 ]
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|| |
|
124
|
-
|| |
|
125
|
-
|| |
|
126
|
-
| | | || |
|
127
|
-
| | | || |
|
128
|
-
| | | || |
|
129
|
-
-----------
|
130
|
-
|
131
|
-
[ COL: abs ]
|
132
|
-
100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
|
133
|
-
100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
|
134
|
-
100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
|
135
|
-
100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
|
136
|
-
100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
|
137
|
-
[ MIN: 0 ]
|
138
|
-
[ MAX: 46 ]
|
139
|
-
[ AVG: 21 ]
|
140
|
-
[ X-INT: 1 ]
|
141
|
-
[ Y-INT: 0 ]
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
| | | || || | |
|
148
|
-
| | | || || | |
|
149
|
-
| | | || || | |
|
150
|
-
| | | || || | |
|
151
|
-
| | | || || | |
|
152
|
-
-----------------------------------------------
|
153
|
-
|
154
|
-
[ VOCABS ]
|
155
|
-
[ VOC: news with 10 tokens ]
|
156
|
-
[ COL: nid ]
|
157
|
-
|
158
|
-
[ VOC: cat with 112 tokens ]
|
159
|
-
[ COL: cat, subCat ]
|
160
|
-
|
161
|
-
[ VOC: english with 30522 tokens ]
|
162
|
-
[ COL: title, abs ]
|
163
|
-
```
|
164
|
-
|
165
|
-
### ReConstruct Unified Tokenizer
|
166
|
-
|
167
|
-
```python
|
168
|
-
unitok = UniTok().add_index_col(
|
169
|
-
name='nid'
|
170
|
-
).add_col(Column(
|
171
|
-
name='cat',
|
172
|
-
tok=cat_tok.as_sing()
|
173
|
-
)).add_col(Column(
|
174
|
-
name='subCat',
|
175
|
-
tok=cat_tok.as_sing(),
|
176
|
-
)).add_col(Column(
|
177
|
-
name='title',
|
178
|
-
tok=text_tok,
|
179
|
-
operator=SeqOperator(max_length=20),
|
180
|
-
)).add_col(Column(
|
181
|
-
name='abs',
|
182
|
-
tok=text_tok,
|
183
|
-
operator=SeqOperator(max_length=30),
|
184
|
-
)).read_file(df)
|
185
|
-
```
|
186
|
-
|
187
|
-
In this step, we set _max_length_ of each column. If _max_length_ is not set, we will keep the **whole** sequence and not truncate it.
|
188
|
-
|
189
|
-
### Tokenize and Store
|
190
|
-
|
191
|
-
```python
|
192
|
-
unitok.tokenize()
|
193
|
-
unitok.store_data('TokenizedData')
|
194
|
-
```
|
@@ -1,208 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.1
|
2
|
-
Name: UniTok
|
3
|
-
Version: 3.0.11
|
4
|
-
Summary: Unified Tokenizer
|
5
|
-
Home-page: https://github.com/Jyonn/UnifiedTokenizer
|
6
|
-
Author: Jyonn Liu
|
7
|
-
Author-email: i@6-79.cn
|
8
|
-
License: MIT Licence
|
9
|
-
Keywords: token,tokenizer,bert
|
10
|
-
Platform: any
|
11
|
-
Description-Content-Type: text/markdown
|
12
|
-
|
13
|
-
# Unified Tokenizer
|
14
|
-
|
15
|
-
> Instructions for **3.0** version will be updated soon!
|
16
|
-
|
17
|
-
## Introduction
|
18
|
-
|
19
|
-
Unified Tokenizer, shortly **UniTok**,
|
20
|
-
offers pre-defined various tokenizers in dealing with textual data.
|
21
|
-
It is a central data processing tool
|
22
|
-
that allows algorithm engineers to focus more on the algorithm itself
|
23
|
-
instead of tedious data preprocessing.
|
24
|
-
|
25
|
-
It incorporates the BERT tokenizer from the [transformers]((https://github.com/huggingface/transformers)) library,
|
26
|
-
while it supports custom via the general word segmentation module (i.e., the `BaseTok` class).
|
27
|
-
|
28
|
-
## Installation
|
29
|
-
|
30
|
-
`pip install UnifiedTokenizer`
|
31
|
-
|
32
|
-
## Usage
|
33
|
-
|
34
|
-
We use the head of the training set of the [MINDlarge](https://msnews.github.io/) dataset as an example (see `news-sample.tsv` file).
|
35
|
-
|
36
|
-
### Data Declaration (more info see [MIND GitHub](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md))
|
37
|
-
|
38
|
-
Each line in the file is a piece of news, including 7 features, which are divided by the tab (`\t`) symbol:
|
39
|
-
|
40
|
-
- News ID
|
41
|
-
- Category
|
42
|
-
- SubCategory
|
43
|
-
- Title
|
44
|
-
- Abstract
|
45
|
-
- URL
|
46
|
-
- Title Entities (entities contained in the title of this news)
|
47
|
-
- Abstract Entities (entities contained in the abstract of this news)
|
48
|
-
|
49
|
-
We only use its first 5 columns for demonstration.
|
50
|
-
|
51
|
-
### Pre-defined Tokenizers
|
52
|
-
|
53
|
-
| Tokenizer | Description | Parameters |
|
54
|
-
|-----------|-----------------------------------------------------------------------|-------------|
|
55
|
-
| BertTok | Provided by the ``transformers` library, using the WordPiece strategy | `vocab_dir` |
|
56
|
-
| EntTok | The column data is regarded as an entire token | / |
|
57
|
-
| IdTok | A specific version of EntTok, required to be identical | / |
|
58
|
-
| SplitTok | Tokens are joined by separators like tab, space | `sep` |
|
59
|
-
|
60
|
-
### Imports
|
61
|
-
|
62
|
-
```python
|
63
|
-
import pandas as pd
|
64
|
-
|
65
|
-
|
66
|
-
from UniTok import UniTok, Column
|
67
|
-
from UniTok.tok import IdTok, EntTok, BertTok
|
68
|
-
```
|
69
|
-
|
70
|
-
### Read data
|
71
|
-
|
72
|
-
```python
|
73
|
-
import pandas as pd
|
74
|
-
|
75
|
-
df = pd.read_csv(
|
76
|
-
filepath_or_buffer='path/news-sample.tsv',
|
77
|
-
sep='\t',
|
78
|
-
names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
|
79
|
-
usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
|
80
|
-
)
|
81
|
-
```
|
82
|
-
|
83
|
-
### Construct UniTok
|
84
|
-
|
85
|
-
```python
|
86
|
-
cat_tok = EntTok(name='cat') # one tokenizer for both cat and subCat
|
87
|
-
text_tok = BertTok(name='english', vocab_dir='bert-base-uncased') # specify the bert vocab
|
88
|
-
|
89
|
-
unitok = UniTok().add_index_col(
|
90
|
-
name='nid'
|
91
|
-
).add_col(Column(
|
92
|
-
name='cat',
|
93
|
-
tok=cat_tok
|
94
|
-
)).add_col(Column(
|
95
|
-
name='subCat',
|
96
|
-
tok=cat_tok,
|
97
|
-
)).add_col(Column(
|
98
|
-
name='title',
|
99
|
-
tok=text_tok,
|
100
|
-
)).add_col(Column(
|
101
|
-
name='abs',
|
102
|
-
tok=text_tok,
|
103
|
-
)).read_file(df)
|
104
|
-
```
|
105
|
-
|
106
|
-
### Analyse Data
|
107
|
-
|
108
|
-
```python
|
109
|
-
unitok.analyse()
|
110
|
-
```
|
111
|
-
|
112
|
-
It shows the distribution of the length of each column (if using _ListTokenizer_). It will help us determine the _max_length_ of the tokens for each column.
|
113
|
-
|
114
|
-
```
|
115
|
-
[ COLUMNS ]
|
116
|
-
[ COL: nid ]
|
117
|
-
[NOT ListTokenizer]
|
118
|
-
|
119
|
-
[ COL: cat ]
|
120
|
-
[NOT ListTokenizer]
|
121
|
-
|
122
|
-
[ COL: subCat ]
|
123
|
-
[NOT ListTokenizer]
|
124
|
-
|
125
|
-
[ COL: title ]
|
126
|
-
[ MIN: 6 ]
|
127
|
-
[ MAX: 16 ]
|
128
|
-
[ AVG: 12 ]
|
129
|
-
[ X-INT: 1 ]
|
130
|
-
[ Y-INT: 0 ]
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|| |
|
136
|
-
|| |
|
137
|
-
|| |
|
138
|
-
| | | || |
|
139
|
-
| | | || |
|
140
|
-
| | | || |
|
141
|
-
-----------
|
142
|
-
|
143
|
-
[ COL: abs ]
|
144
|
-
100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
|
145
|
-
100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
|
146
|
-
100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
|
147
|
-
100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
|
148
|
-
100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
|
149
|
-
[ MIN: 0 ]
|
150
|
-
[ MAX: 46 ]
|
151
|
-
[ AVG: 21 ]
|
152
|
-
[ X-INT: 1 ]
|
153
|
-
[ Y-INT: 0 ]
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
| | | || || | |
|
160
|
-
| | | || || | |
|
161
|
-
| | | || || | |
|
162
|
-
| | | || || | |
|
163
|
-
| | | || || | |
|
164
|
-
-----------------------------------------------
|
165
|
-
|
166
|
-
[ VOCABS ]
|
167
|
-
[ VOC: news with 10 tokens ]
|
168
|
-
[ COL: nid ]
|
169
|
-
|
170
|
-
[ VOC: cat with 112 tokens ]
|
171
|
-
[ COL: cat, subCat ]
|
172
|
-
|
173
|
-
[ VOC: english with 30522 tokens ]
|
174
|
-
[ COL: title, abs ]
|
175
|
-
```
|
176
|
-
|
177
|
-
### ReConstruct Unified Tokenizer
|
178
|
-
|
179
|
-
```python
|
180
|
-
unitok = UniTok().add_index_col(
|
181
|
-
name='nid'
|
182
|
-
).add_col(Column(
|
183
|
-
name='cat',
|
184
|
-
tok=cat_tok.as_sing()
|
185
|
-
)).add_col(Column(
|
186
|
-
name='subCat',
|
187
|
-
tok=cat_tok.as_sing(),
|
188
|
-
)).add_col(Column(
|
189
|
-
name='title',
|
190
|
-
tok=text_tok,
|
191
|
-
operator=SeqOperator(max_length=20),
|
192
|
-
)).add_col(Column(
|
193
|
-
name='abs',
|
194
|
-
tok=text_tok,
|
195
|
-
operator=SeqOperator(max_length=30),
|
196
|
-
)).read_file(df)
|
197
|
-
```
|
198
|
-
|
199
|
-
In this step, we set _max_length_ of each column. If _max_length_ is not set, we will keep the **whole** sequence and not truncate it.
|
200
|
-
|
201
|
-
### Tokenize and Store
|
202
|
-
|
203
|
-
```python
|
204
|
-
unitok.tokenize()
|
205
|
-
unitok.store_data('TokenizedData')
|
206
|
-
```
|
207
|
-
|
208
|
-
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|