UniTok 3.0.11__tar.gz → 3.0.13__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. UniTok-3.0.13/PKG-INFO +142 -0
  2. UniTok-3.0.13/README.md +128 -0
  3. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/column.py +3 -3
  4. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/meta.py +12 -0
  5. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/__init__.py +1 -1
  6. UniTok-3.0.11/UniTok/tok/entity_tok.py → UniTok-3.0.13/UniTok/tok/ent_tok.py +1 -1
  7. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/unidep.py +22 -1
  8. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/unitok.py +15 -2
  9. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/vocab.py +20 -8
  10. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/vocabs.py +1 -1
  11. UniTok-3.0.13/UniTok.egg-info/PKG-INFO +142 -0
  12. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok.egg-info/SOURCES.txt +1 -1
  13. {UniTok-3.0.11 → UniTok-3.0.13}/setup.py +1 -1
  14. UniTok-3.0.11/PKG-INFO +0 -208
  15. UniTok-3.0.11/README.md +0 -194
  16. UniTok-3.0.11/UniTok.egg-info/PKG-INFO +0 -208
  17. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/__init__.py +0 -0
  18. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/analysis/__init__.py +0 -0
  19. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/analysis/lengths.py +0 -0
  20. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/analysis/plot.py +0 -0
  21. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/cols.py +0 -0
  22. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/global_setting.py +0 -0
  23. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/bert_tok.py +0 -0
  24. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/id_tok.py +0 -0
  25. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/number_tok.py +0 -0
  26. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/seq_tok.py +0 -0
  27. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/split_tok.py +0 -0
  28. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/tok.py +0 -0
  29. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok.egg-info/dependency_links.txt +0 -0
  30. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok.egg-info/requires.txt +0 -0
  31. {UniTok-3.0.11 → UniTok-3.0.13}/UniTok.egg-info/top_level.txt +0 -0
  32. {UniTok-3.0.11 → UniTok-3.0.13}/setup.cfg +0 -0
UniTok-3.0.13/PKG-INFO ADDED
@@ -0,0 +1,142 @@
1
+ Metadata-Version: 2.1
2
+ Name: UniTok
3
+ Version: 3.0.13
4
+ Summary: Unified Tokenizer
5
+ Home-page: https://github.com/Jyonn/UnifiedTokenizer
6
+ Author: Jyonn Liu
7
+ Author-email: i@6-79.cn
8
+ License: MIT Licence
9
+ Keywords: token,tokenizer,bert
10
+ Platform: any
11
+ Description-Content-Type: text/markdown
12
+
13
+ # UniTok V3.0
14
+
15
+ ## 介绍
16
+
17
+ UniTok是一个面向机器学习的统一文本数据预处理工具。它提供了一系列预定义的分词器,以便于处理不同类型的文本数据。UniTok简单易上手,让算法工程师能更专注优化算法本身,大大降低了数据预处理的难度。
18
+ UniDep是UniTok预处理后数据的解析工具,能和PyTorch的Dataset类配合使用。
19
+
20
+ ## 安装
21
+
22
+ `pip install unitok>=3.0.11`
23
+
24
+ ## 使用
25
+
26
+ 我们以新闻推荐系统场景为例,数据集可能包含以下部分:
27
+
28
+ - 新闻内容数据`(news.tsv)`:每一行是一条新闻,包含新闻ID、新闻标题、摘要、类别、子类别等多个特征,用`\t`分隔。
29
+ - 用户历史数据`(user.tsv)`:每一行是一位用户,包含用户ID和用户历史点击新闻的ID列表,新闻ID用` `分隔。
30
+ - 交互数据:包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录,包含用户ID、新闻ID、是否点击,用`\t`分隔。
31
+
32
+ 我们首先分析以上每个属性的数据类型:
33
+
34
+ | 文件 | 属性 | 类型 | 样例 | 备注 |
35
+ |-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
36
+ | news.tsv | nid | str | N1234 | 新闻ID,唯一标识 |
37
+ | news.tsv | title | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题,通常用BertTokenizer分词 |
38
+ | news.tsv | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now. | 新闻摘要,通常用BertTokenizer分词 |
39
+ | news.tsv | category | str | Technology | 新闻类别,不可分割 |
40
+ | news.tsv | subcat | str | Mobile | 新闻子类别,不可分割 |
41
+ | user.tsv | uid | str | U1234 | 用户ID,唯一标识 |
42
+ | user.tsv | history | str | N1234 N1235 N1236 | 用户历史,被` `分割 |
43
+ | train.tsv | uid | str | U1234 | 用户ID,与`user.tsv`一致 |
44
+ | train.tsv | nid | str | N1234 | 新闻ID,与`news.tsv`一致 |
45
+ | train.tsv | label | int | 1 | 是否点击,0表示未点击,1表示点击 |
46
+
47
+ 我们可以对以上属性进行分类:
48
+
49
+ | 属性 | 类型 | 预设分词器 | 备注 |
50
+ |------------------|-----|-----------|-------------------------------------|
51
+ | nid, uid, index | str | IdTok | 唯一标识 |
52
+ | title, abstract | str | BertTok | 指定参数`vocab_dir="bert-base-uncased"` |
53
+ | category, subcat | str | EntityTok | 不可分割 |
54
+ | history | str | SplitTok | 指定参数`sep=' '` |
55
+ | label | int | NumberTok | 指定参数`vocab_size=2`,只有0和1两种情况 |
56
+
57
+ 通过以下代码,我们可以针对每个文件构建一个UniTok对象:
58
+
59
+ ```python
60
+ from UniTok import UniTok, Column, Vocab
61
+ from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok
62
+
63
+ nid_vocab = Vocab('nid') # 在新闻数据、历史数据和交互数据中都会用到
64
+ eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng') # 用于处理英文文本
65
+
66
+ news_ut = UniTok().add_col(Column(
67
+ name='nid', # 数据栏名称,如果和tok的name一致,可以省略
68
+ tok=IdTok(vocab=nid_vocab), # 指定分词器,每个UniTok对象必须有且只能有一个IdTok
69
+ )).add_col(Column(
70
+ name='title',
71
+ tok=eng_tok,
72
+ max_length=20, # 指定最大长度,超过的部分会被截断
73
+ )).add_col(Column(
74
+ name='abstract',
75
+ tok=eng_tok, # 摘要和标题使用同一个分词器
76
+ max_length=30, # 指定最大长度,超过的部分会被截断
77
+ )).add_col(Column(
78
+ name='category',
79
+ tok=EntTok(name='cat'), # 不显式指定Vocab,会根据name自动创建
80
+ )).add_col(Column(
81
+ name='subcat',
82
+ tok=EntTok(name='subcat'),
83
+ ))
84
+
85
+ news_ut.read('news.tsv', sep='\t') # 读取数据文件
86
+ news_ut.tokenize() # 运行分词
87
+ news_ut.store('data/news') # 保存分词结果
88
+
89
+ uid_vocab = Vocab('uid') # 在用户数据和交互数据中都会用到
90
+
91
+ user_ut = UniTok().add_col(Column(
92
+ name='uid',
93
+ tok=IdTok(vocab=uid_vocab),
94
+ )).add_col(Column(
95
+ name='history',
96
+ tok=SplitTok(sep=' '), # 历史数据中的新闻ID用空格分割
97
+ ))
98
+
99
+ user_ut.read('user.tsv', sep='\t') # 读取数据文件
100
+ user_ut.tokenize() # 运行分词
101
+ user_ut.store('data/user') # 保存分词结果
102
+
103
+
104
+ def inter_tokenize(mode):
105
+ # 由于train/dev/test的index不同,每次预处理前都需要重新构建UniTok对象
106
+ # 如果不重新构建,index词表可能不准确,导致元数据和真实数据不一致
107
+ # 但通过UniDep解析数据后,能修正index的误差
108
+
109
+ inter_ut = UniTok().add_index_col(
110
+ # 交互数据中的index列是自动生成的,不需要指定分词器
111
+ ).add_col(Column(
112
+ name='uid',
113
+ tok=IdTok(vocab=uid_vocab), # 指定和user_ut中的uid列一致
114
+ )).add_col(Column(
115
+ name='nid',
116
+ tok=IdTok(vocab=nid_vocab), # 指定和news_ut中的nid列一致
117
+ )).add_col(Column(
118
+ name='label',
119
+ tok=NumberTok(vocab_size=2), # 0和1两种情况,>=3.0.11版本支持
120
+ ))
121
+
122
+ inter_ut.read(f'{mode}.tsv', sep='\t') # 读取数据文件
123
+ inter_ut.tokenize() # 运行分词
124
+ inter_ut.store(mode) # 保存分词结果
125
+
126
+
127
+ inter_tokenize('data/train')
128
+ inter_tokenize('data/dev')
129
+ inter_tokenize('data/test')
130
+ ```
131
+
132
+ 我们可以用UniDep解析数据:
133
+
134
+ ```python
135
+ from UniTok import UniDep
136
+
137
+ news_dep = UniDep('data/news') # 读取分词结果
138
+ print(len(news_dep))
139
+ print(news_dep[0])
140
+ ```
141
+
142
+
@@ -0,0 +1,128 @@
1
+ # UniTok V3.0
2
+
3
+ ## 介绍
4
+
5
+ UniTok是一个面向机器学习的统一文本数据预处理工具。它提供了一系列预定义的分词器,以便于处理不同类型的文本数据。UniTok简单易上手,让算法工程师能更专注优化算法本身,大大降低了数据预处理的难度。
6
+ UniDep是UniTok预处理后数据的解析工具,能和PyTorch的Dataset类配合使用。
7
+
8
+ ## 安装
9
+
10
+ `pip install unitok>=3.0.11`
11
+
12
+ ## 使用
13
+
14
+ 我们以新闻推荐系统场景为例,数据集可能包含以下部分:
15
+
16
+ - 新闻内容数据`(news.tsv)`:每一行是一条新闻,包含新闻ID、新闻标题、摘要、类别、子类别等多个特征,用`\t`分隔。
17
+ - 用户历史数据`(user.tsv)`:每一行是一位用户,包含用户ID和用户历史点击新闻的ID列表,新闻ID用` `分隔。
18
+ - 交互数据:包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录,包含用户ID、新闻ID、是否点击,用`\t`分隔。
19
+
20
+ 我们首先分析以上每个属性的数据类型:
21
+
22
+ | 文件 | 属性 | 类型 | 样例 | 备注 |
23
+ |-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
24
+ | news.tsv | nid | str | N1234 | 新闻ID,唯一标识 |
25
+ | news.tsv | title | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题,通常用BertTokenizer分词 |
26
+ | news.tsv | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now. | 新闻摘要,通常用BertTokenizer分词 |
27
+ | news.tsv | category | str | Technology | 新闻类别,不可分割 |
28
+ | news.tsv | subcat | str | Mobile | 新闻子类别,不可分割 |
29
+ | user.tsv | uid | str | U1234 | 用户ID,唯一标识 |
30
+ | user.tsv | history | str | N1234 N1235 N1236 | 用户历史,被` `分割 |
31
+ | train.tsv | uid | str | U1234 | 用户ID,与`user.tsv`一致 |
32
+ | train.tsv | nid | str | N1234 | 新闻ID,与`news.tsv`一致 |
33
+ | train.tsv | label | int | 1 | 是否点击,0表示未点击,1表示点击 |
34
+
35
+ 我们可以对以上属性进行分类:
36
+
37
+ | 属性 | 类型 | 预设分词器 | 备注 |
38
+ |------------------|-----|-----------|-------------------------------------|
39
+ | nid, uid, index | str | IdTok | 唯一标识 |
40
+ | title, abstract | str | BertTok | 指定参数`vocab_dir="bert-base-uncased"` |
41
+ | category, subcat | str | EntityTok | 不可分割 |
42
+ | history | str | SplitTok | 指定参数`sep=' '` |
43
+ | label | int | NumberTok | 指定参数`vocab_size=2`,只有0和1两种情况 |
44
+
45
+ 通过以下代码,我们可以针对每个文件构建一个UniTok对象:
46
+
47
+ ```python
48
+ from UniTok import UniTok, Column, Vocab
49
+ from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok
50
+
51
+ nid_vocab = Vocab('nid') # 在新闻数据、历史数据和交互数据中都会用到
52
+ eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng') # 用于处理英文文本
53
+
54
+ news_ut = UniTok().add_col(Column(
55
+ name='nid', # 数据栏名称,如果和tok的name一致,可以省略
56
+ tok=IdTok(vocab=nid_vocab), # 指定分词器,每个UniTok对象必须有且只能有一个IdTok
57
+ )).add_col(Column(
58
+ name='title',
59
+ tok=eng_tok,
60
+ max_length=20, # 指定最大长度,超过的部分会被截断
61
+ )).add_col(Column(
62
+ name='abstract',
63
+ tok=eng_tok, # 摘要和标题使用同一个分词器
64
+ max_length=30, # 指定最大长度,超过的部分会被截断
65
+ )).add_col(Column(
66
+ name='category',
67
+ tok=EntTok(name='cat'), # 不显式指定Vocab,会根据name自动创建
68
+ )).add_col(Column(
69
+ name='subcat',
70
+ tok=EntTok(name='subcat'),
71
+ ))
72
+
73
+ news_ut.read('news.tsv', sep='\t') # 读取数据文件
74
+ news_ut.tokenize() # 运行分词
75
+ news_ut.store('data/news') # 保存分词结果
76
+
77
+ uid_vocab = Vocab('uid') # 在用户数据和交互数据中都会用到
78
+
79
+ user_ut = UniTok().add_col(Column(
80
+ name='uid',
81
+ tok=IdTok(vocab=uid_vocab),
82
+ )).add_col(Column(
83
+ name='history',
84
+ tok=SplitTok(sep=' '), # 历史数据中的新闻ID用空格分割
85
+ ))
86
+
87
+ user_ut.read('user.tsv', sep='\t') # 读取数据文件
88
+ user_ut.tokenize() # 运行分词
89
+ user_ut.store('data/user') # 保存分词结果
90
+
91
+
92
+ def inter_tokenize(mode):
93
+ # 由于train/dev/test的index不同,每次预处理前都需要重新构建UniTok对象
94
+ # 如果不重新构建,index词表可能不准确,导致元数据和真实数据不一致
95
+ # 但通过UniDep解析数据后,能修正index的误差
96
+
97
+ inter_ut = UniTok().add_index_col(
98
+ # 交互数据中的index列是自动生成的,不需要指定分词器
99
+ ).add_col(Column(
100
+ name='uid',
101
+ tok=IdTok(vocab=uid_vocab), # 指定和user_ut中的uid列一致
102
+ )).add_col(Column(
103
+ name='nid',
104
+ tok=IdTok(vocab=nid_vocab), # 指定和news_ut中的nid列一致
105
+ )).add_col(Column(
106
+ name='label',
107
+ tok=NumberTok(vocab_size=2), # 0和1两种情况,>=3.0.11版本支持
108
+ ))
109
+
110
+ inter_ut.read(f'{mode}.tsv', sep='\t') # 读取数据文件
111
+ inter_ut.tokenize() # 运行分词
112
+ inter_ut.store(mode) # 保存分词结果
113
+
114
+
115
+ inter_tokenize('data/train')
116
+ inter_tokenize('data/dev')
117
+ inter_tokenize('data/test')
118
+ ```
119
+
120
+ 我们可以用UniDep解析数据:
121
+
122
+ ```python
123
+ from UniTok import UniDep
124
+
125
+ news_dep = UniDep('data/news') # 读取分词结果
126
+ print(len(news_dep))
127
+ print(news_dep[0])
128
+ ```
@@ -42,9 +42,9 @@ class Column:
42
42
  tok (BaseTok): The tokenizer of the column.
43
43
  operator (SeqOperator): The operator of the column.
44
44
  """
45
- def __init__(self, name, tok: BaseTok, operator: SeqOperator = None, **kwargs):
46
- self.name = name
45
+ def __init__(self, tok: BaseTok, name=None, operator: SeqOperator = None, **kwargs):
47
46
  self.tok = tok
47
+ self.name = name or tok.vocab.name
48
48
  self.operator = operator
49
49
 
50
50
  if kwargs:
@@ -115,4 +115,4 @@ class Column:
115
115
 
116
116
  class IndexColumn(Column):
117
117
  def __init__(self, name='index'):
118
- super().__init__(name, tok=IdTok(name=name))
118
+ super().__init__(tok=IdTok(name=name))
@@ -127,3 +127,15 @@ class Meta:
127
127
  print('Old meta data backed up to {}.'.format(self.path + '.bak'))
128
128
  self.save()
129
129
  print('Meta data upgraded.')
130
+
131
+ @property
132
+ def col_info(self):
133
+ warnings.warn('col_info is deprecated, use cols instead.'
134
+ '(meta.col_info -> meta.cols)', DeprecationWarning)
135
+ return self.cols
136
+
137
+ @property
138
+ def vocab_info(self):
139
+ warnings.warn('vocab_info is deprecated, use vocs instead.'
140
+ '(meta.vocab_info -> meta.vocs)', DeprecationWarning)
141
+ return self.vocs
@@ -1,5 +1,5 @@
1
1
  from .bert_tok import BertTok
2
- from .entity_tok import EntTok
2
+ from .ent_tok import EntTok
3
3
  from .id_tok import IdTok
4
4
  from .split_tok import SplitTok
5
5
  from .number_tok import NumberTok
@@ -9,7 +9,7 @@ class EntTok(BaseTok):
9
9
  ...: other arguments of the BaseTok
10
10
 
11
11
  Example:
12
- >>> from UniTok.tok.entity_tok import EntTok
12
+ >>> from UniTok.tok.ent_tok import EntTok
13
13
  >>> tok = EntTok(name='entity')
14
14
  >>> tok('JJ Lin') # 0
15
15
  >>> tok('Jay Chou') # 1
@@ -104,7 +104,28 @@ class UniDep:
104
104
  return self.sample_size
105
105
 
106
106
  def __str__(self):
107
- return f'UniDep from {self.store_dir}'
107
+ """ UniDep (dir):
108
+
109
+ Sample Size: 1000
110
+ Id Column: id
111
+ Columns:
112
+ id, vocab index (size 1000)
113
+ text, vocab eng (size 30522), max length 100
114
+ label, vocab label (size 2)
115
+ """
116
+ introduction = f"""
117
+ UniDep ({self.meta.parse_version(self.meta.version)}): {self.store_dir}
118
+
119
+ Sample Size: {self.sample_size}
120
+ Id Column: {self.id_col}
121
+ Columns:\n"""
122
+
123
+ for col_name, col in self.cols.items(): # type: str, Col
124
+ introduction += f' \t{col_name}, vocab {col.voc.name} (size {col.voc.size})'
125
+ if col.max_length:
126
+ introduction += f', max length {col.max_length}'
127
+ introduction += '\n'
128
+ return introduction
108
129
 
109
130
  def __repr__(self):
110
131
  return str(self)
@@ -9,7 +9,7 @@ import pandas as pd
9
9
  from .cols import Cols
10
10
  from .column import Column, IndexColumn
11
11
  from .tok.bert_tok import BertTok
12
- from .tok.entity_tok import EntTok
12
+ from .tok.ent_tok import EntTok
13
13
  from .tok.id_tok import IdTok
14
14
  from .vocab import Vocab
15
15
  from .vocabs import Vocabs
@@ -97,7 +97,7 @@ class UniTok:
97
97
  self.id_col = col
98
98
  return self
99
99
 
100
- def read_file(self, df, sep=None):
100
+ def read(self, df, sep=None):
101
101
  """
102
102
  Read data from a file
103
103
  """
@@ -107,6 +107,11 @@ class UniTok:
107
107
  self.data = df
108
108
  return self
109
109
 
110
+ def read_file(self, df, sep=None):
111
+ warnings.warn('read_file is deprecated, use read instead '
112
+ '(will be removed in 4.x version)', DeprecationWarning)
113
+ return self.read(df, sep)
114
+
110
115
  def __getitem__(self, col):
111
116
  """
112
117
  Get the data of a column
@@ -166,6 +171,14 @@ class UniTok:
166
171
  return self.cols[col_name].tok.vocab.get_store_path(store_dir)
167
172
 
168
173
  def store_data(self, store_dir):
174
+ """
175
+ Store the tokenized data
176
+ """
177
+ warnings.warn('unitok.store_data is deprecated, use store instead '
178
+ '(will be removed in 4.x version)', DeprecationWarning)
179
+ self.store(store_dir)
180
+
181
+ def store(self, store_dir):
169
182
  """
170
183
  Store the tokenized data
171
184
  """
@@ -94,7 +94,7 @@ class Vocab:
94
94
  """
95
95
  set first n tokens as reserved tokens
96
96
  """
97
- if self.get_size():
97
+ if len(self):
98
98
  raise ValueError(f'vocab {self.name} is not empty, can not reserve tokens')
99
99
 
100
100
  self.reserved_tokens = tokens
@@ -107,17 +107,25 @@ class Vocab:
107
107
  return self
108
108
 
109
109
  def get_tokens(self):
110
- return [self.i2o[i] for i in range(len(self))]
110
+ warnings.warn('vocab.get_tokens is deprecated, '
111
+ 'use list(vocab) instead (will be removed in 4.x version)', DeprecationWarning)
112
+ return list(self)
111
113
 
112
114
  def get_size(self):
113
- return len(self.i2o)
115
+ warnings.warn('vocab.get_size is deprecated, '
116
+ 'use len(vocab) instead (will be removed in 4.x version)', DeprecationWarning)
117
+ return len(self)
114
118
 
115
119
  def __len__(self):
116
- return self.get_size()
120
+ return len(self.i2o)
117
121
 
118
122
  def __bool__(self):
119
123
  return True
120
124
 
125
+ def __iter__(self):
126
+ for i in range(len(self)):
127
+ yield self.i2o[i]
128
+
121
129
  """
122
130
  Editable Methods
123
131
  """
@@ -163,8 +171,8 @@ class Vocab:
163
171
  def save(self, store_dir):
164
172
  store_path = self.get_store_path(store_dir)
165
173
  with open(store_path, 'w') as f:
166
- for i in range(len(self.i2o)):
167
- f.write('{}\n'.format(self.i2o[i]))
174
+ for token in self:
175
+ f.write('{}\n'.format(token))
168
176
 
169
177
  return self
170
178
 
@@ -184,8 +192,10 @@ class Vocab:
184
192
  def trim(self, min_count=None, min_frequency=1):
185
193
  """
186
194
  trim vocab by min frequency
187
- :return:
195
+ :return: trimmed tokens
188
196
  """
197
+ _trimmed = []
198
+
189
199
  if min_count is None:
190
200
  warnings.warn('vocab.min_frequency is deprecated, '
191
201
  'use vocab.min_count instead (will be removed in 4.x version)', DeprecationWarning)
@@ -195,6 +205,8 @@ class Vocab:
195
205
  for index in self._counter:
196
206
  if self._counter[index] >= min_count:
197
207
  vocabs.append(self.i2o[index])
208
+ else:
209
+ _trimmed.append(self.i2o[index])
198
210
 
199
211
  self.i2o = dict()
200
212
  self.o2i = dict()
@@ -205,7 +217,7 @@ class Vocab:
205
217
  self.extend(vocabs)
206
218
 
207
219
  self._stable_mode = True
208
- return self
220
+ return _trimmed
209
221
 
210
222
  def summarize(self, base=10):
211
223
  """
@@ -41,7 +41,7 @@ class Vocabs(dict):
41
41
  Get the information of all vocabs
42
42
  """
43
43
  return {vocab.name: dict(
44
- size=vocab.get_size(),
44
+ size=len(vocab),
45
45
  cols=self.cols[vocab.name],
46
46
  ) for vocab in self.values()}
47
47
 
@@ -0,0 +1,142 @@
1
+ Metadata-Version: 2.1
2
+ Name: UniTok
3
+ Version: 3.0.13
4
+ Summary: Unified Tokenizer
5
+ Home-page: https://github.com/Jyonn/UnifiedTokenizer
6
+ Author: Jyonn Liu
7
+ Author-email: i@6-79.cn
8
+ License: MIT Licence
9
+ Keywords: token,tokenizer,bert
10
+ Platform: any
11
+ Description-Content-Type: text/markdown
12
+
13
+ # UniTok V3.0
14
+
15
+ ## 介绍
16
+
17
+ UniTok是一个面向机器学习的统一文本数据预处理工具。它提供了一系列预定义的分词器,以便于处理不同类型的文本数据。UniTok简单易上手,让算法工程师能更专注优化算法本身,大大降低了数据预处理的难度。
18
+ UniDep是UniTok预处理后数据的解析工具,能和PyTorch的Dataset类配合使用。
19
+
20
+ ## 安装
21
+
22
+ `pip install unitok>=3.0.11`
23
+
24
+ ## 使用
25
+
26
+ 我们以新闻推荐系统场景为例,数据集可能包含以下部分:
27
+
28
+ - 新闻内容数据`(news.tsv)`:每一行是一条新闻,包含新闻ID、新闻标题、摘要、类别、子类别等多个特征,用`\t`分隔。
29
+ - 用户历史数据`(user.tsv)`:每一行是一位用户,包含用户ID和用户历史点击新闻的ID列表,新闻ID用` `分隔。
30
+ - 交互数据:包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录,包含用户ID、新闻ID、是否点击,用`\t`分隔。
31
+
32
+ 我们首先分析以上每个属性的数据类型:
33
+
34
+ | 文件 | 属性 | 类型 | 样例 | 备注 |
35
+ |-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
36
+ | news.tsv | nid | str | N1234 | 新闻ID,唯一标识 |
37
+ | news.tsv | title | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题,通常用BertTokenizer分词 |
38
+ | news.tsv | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now. | 新闻摘要,通常用BertTokenizer分词 |
39
+ | news.tsv | category | str | Technology | 新闻类别,不可分割 |
40
+ | news.tsv | subcat | str | Mobile | 新闻子类别,不可分割 |
41
+ | user.tsv | uid | str | U1234 | 用户ID,唯一标识 |
42
+ | user.tsv | history | str | N1234 N1235 N1236 | 用户历史,被` `分割 |
43
+ | train.tsv | uid | str | U1234 | 用户ID,与`user.tsv`一致 |
44
+ | train.tsv | nid | str | N1234 | 新闻ID,与`news.tsv`一致 |
45
+ | train.tsv | label | int | 1 | 是否点击,0表示未点击,1表示点击 |
46
+
47
+ 我们可以对以上属性进行分类:
48
+
49
+ | 属性 | 类型 | 预设分词器 | 备注 |
50
+ |------------------|-----|-----------|-------------------------------------|
51
+ | nid, uid, index | str | IdTok | 唯一标识 |
52
+ | title, abstract | str | BertTok | 指定参数`vocab_dir="bert-base-uncased"` |
53
+ | category, subcat | str | EntityTok | 不可分割 |
54
+ | history | str | SplitTok | 指定参数`sep=' '` |
55
+ | label | int | NumberTok | 指定参数`vocab_size=2`,只有0和1两种情况 |
56
+
57
+ 通过以下代码,我们可以针对每个文件构建一个UniTok对象:
58
+
59
+ ```python
60
+ from UniTok import UniTok, Column, Vocab
61
+ from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok
62
+
63
+ nid_vocab = Vocab('nid') # 在新闻数据、历史数据和交互数据中都会用到
64
+ eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng') # 用于处理英文文本
65
+
66
+ news_ut = UniTok().add_col(Column(
67
+ name='nid', # 数据栏名称,如果和tok的name一致,可以省略
68
+ tok=IdTok(vocab=nid_vocab), # 指定分词器,每个UniTok对象必须有且只能有一个IdTok
69
+ )).add_col(Column(
70
+ name='title',
71
+ tok=eng_tok,
72
+ max_length=20, # 指定最大长度,超过的部分会被截断
73
+ )).add_col(Column(
74
+ name='abstract',
75
+ tok=eng_tok, # 摘要和标题使用同一个分词器
76
+ max_length=30, # 指定最大长度,超过的部分会被截断
77
+ )).add_col(Column(
78
+ name='category',
79
+ tok=EntTok(name='cat'), # 不显式指定Vocab,会根据name自动创建
80
+ )).add_col(Column(
81
+ name='subcat',
82
+ tok=EntTok(name='subcat'),
83
+ ))
84
+
85
+ news_ut.read('news.tsv', sep='\t') # 读取数据文件
86
+ news_ut.tokenize() # 运行分词
87
+ news_ut.store('data/news') # 保存分词结果
88
+
89
+ uid_vocab = Vocab('uid') # 在用户数据和交互数据中都会用到
90
+
91
+ user_ut = UniTok().add_col(Column(
92
+ name='uid',
93
+ tok=IdTok(vocab=uid_vocab),
94
+ )).add_col(Column(
95
+ name='history',
96
+ tok=SplitTok(sep=' '), # 历史数据中的新闻ID用空格分割
97
+ ))
98
+
99
+ user_ut.read('user.tsv', sep='\t') # 读取数据文件
100
+ user_ut.tokenize() # 运行分词
101
+ user_ut.store('data/user') # 保存分词结果
102
+
103
+
104
+ def inter_tokenize(mode):
105
+ # 由于train/dev/test的index不同,每次预处理前都需要重新构建UniTok对象
106
+ # 如果不重新构建,index词表可能不准确,导致元数据和真实数据不一致
107
+ # 但通过UniDep解析数据后,能修正index的误差
108
+
109
+ inter_ut = UniTok().add_index_col(
110
+ # 交互数据中的index列是自动生成的,不需要指定分词器
111
+ ).add_col(Column(
112
+ name='uid',
113
+ tok=IdTok(vocab=uid_vocab), # 指定和user_ut中的uid列一致
114
+ )).add_col(Column(
115
+ name='nid',
116
+ tok=IdTok(vocab=nid_vocab), # 指定和news_ut中的nid列一致
117
+ )).add_col(Column(
118
+ name='label',
119
+ tok=NumberTok(vocab_size=2), # 0和1两种情况,>=3.0.11版本支持
120
+ ))
121
+
122
+ inter_ut.read(f'{mode}.tsv', sep='\t') # 读取数据文件
123
+ inter_ut.tokenize() # 运行分词
124
+ inter_ut.store(mode) # 保存分词结果
125
+
126
+
127
+ inter_tokenize('data/train')
128
+ inter_tokenize('data/dev')
129
+ inter_tokenize('data/test')
130
+ ```
131
+
132
+ 我们可以用UniDep解析数据:
133
+
134
+ ```python
135
+ from UniTok import UniDep
136
+
137
+ news_dep = UniDep('data/news') # 读取分词结果
138
+ print(len(news_dep))
139
+ print(news_dep[0])
140
+ ```
141
+
142
+
@@ -19,7 +19,7 @@ UniTok/analysis/lengths.py
19
19
  UniTok/analysis/plot.py
20
20
  UniTok/tok/__init__.py
21
21
  UniTok/tok/bert_tok.py
22
- UniTok/tok/entity_tok.py
22
+ UniTok/tok/ent_tok.py
23
23
  UniTok/tok/id_tok.py
24
24
  UniTok/tok/number_tok.py
25
25
  UniTok/tok/seq_tok.py
@@ -6,7 +6,7 @@ long_description = (this_directory / "README.md").read_text()
6
6
 
7
7
  setup(
8
8
  name='UniTok',
9
- version='3.0.11',
9
+ version='3.0.13',
10
10
  keywords=['token', 'tokenizer', 'bert'],
11
11
  description='Unified Tokenizer',
12
12
  long_description=long_description,
UniTok-3.0.11/PKG-INFO DELETED
@@ -1,208 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: UniTok
3
- Version: 3.0.11
4
- Summary: Unified Tokenizer
5
- Home-page: https://github.com/Jyonn/UnifiedTokenizer
6
- Author: Jyonn Liu
7
- Author-email: i@6-79.cn
8
- License: MIT Licence
9
- Keywords: token,tokenizer,bert
10
- Platform: any
11
- Description-Content-Type: text/markdown
12
-
13
- # Unified Tokenizer
14
-
15
- > Instructions for **3.0** version will be updated soon!
16
-
17
- ## Introduction
18
-
19
- Unified Tokenizer, shortly **UniTok**,
20
- offers pre-defined various tokenizers in dealing with textual data.
21
- It is a central data processing tool
22
- that allows algorithm engineers to focus more on the algorithm itself
23
- instead of tedious data preprocessing.
24
-
25
- It incorporates the BERT tokenizer from the [transformers]((https://github.com/huggingface/transformers)) library,
26
- while it supports custom via the general word segmentation module (i.e., the `BaseTok` class).
27
-
28
- ## Installation
29
-
30
- `pip install UnifiedTokenizer`
31
-
32
- ## Usage
33
-
34
- We use the head of the training set of the [MINDlarge](https://msnews.github.io/) dataset as an example (see `news-sample.tsv` file).
35
-
36
- ### Data Declaration (more info see [MIND GitHub](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md))
37
-
38
- Each line in the file is a piece of news, including 7 features, which are divided by the tab (`\t`) symbol:
39
-
40
- - News ID
41
- - Category
42
- - SubCategory
43
- - Title
44
- - Abstract
45
- - URL
46
- - Title Entities (entities contained in the title of this news)
47
- - Abstract Entities (entities contained in the abstract of this news)
48
-
49
- We only use its first 5 columns for demonstration.
50
-
51
- ### Pre-defined Tokenizers
52
-
53
- | Tokenizer | Description | Parameters |
54
- |-----------|-----------------------------------------------------------------------|-------------|
55
- | BertTok | Provided by the ``transformers` library, using the WordPiece strategy | `vocab_dir` |
56
- | EntTok | The column data is regarded as an entire token | / |
57
- | IdTok | A specific version of EntTok, required to be identical | / |
58
- | SplitTok | Tokens are joined by separators like tab, space | `sep` |
59
-
60
- ### Imports
61
-
62
- ```python
63
- import pandas as pd
64
-
65
-
66
- from UniTok import UniTok, Column
67
- from UniTok.tok import IdTok, EntTok, BertTok
68
- ```
69
-
70
- ### Read data
71
-
72
- ```python
73
- import pandas as pd
74
-
75
- df = pd.read_csv(
76
- filepath_or_buffer='path/news-sample.tsv',
77
- sep='\t',
78
- names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
79
- usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
80
- )
81
- ```
82
-
83
- ### Construct UniTok
84
-
85
- ```python
86
- cat_tok = EntTok(name='cat') # one tokenizer for both cat and subCat
87
- text_tok = BertTok(name='english', vocab_dir='bert-base-uncased') # specify the bert vocab
88
-
89
- unitok = UniTok().add_index_col(
90
- name='nid'
91
- ).add_col(Column(
92
- name='cat',
93
- tok=cat_tok
94
- )).add_col(Column(
95
- name='subCat',
96
- tok=cat_tok,
97
- )).add_col(Column(
98
- name='title',
99
- tok=text_tok,
100
- )).add_col(Column(
101
- name='abs',
102
- tok=text_tok,
103
- )).read_file(df)
104
- ```
105
-
106
- ### Analyse Data
107
-
108
- ```python
109
- unitok.analyse()
110
- ```
111
-
112
- It shows the distribution of the length of each column (if using _ListTokenizer_). It will help us determine the _max_length_ of the tokens for each column.
113
-
114
- ```
115
- [ COLUMNS ]
116
- [ COL: nid ]
117
- [NOT ListTokenizer]
118
-
119
- [ COL: cat ]
120
- [NOT ListTokenizer]
121
-
122
- [ COL: subCat ]
123
- [NOT ListTokenizer]
124
-
125
- [ COL: title ]
126
- [ MIN: 6 ]
127
- [ MAX: 16 ]
128
- [ AVG: 12 ]
129
- [ X-INT: 1 ]
130
- [ Y-INT: 0 ]
131
- |
132
- |
133
- |
134
- |
135
- || |
136
- || |
137
- || |
138
- | | | || |
139
- | | | || |
140
- | | | || |
141
- -----------
142
-
143
- [ COL: abs ]
144
- 100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
145
- 100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
146
- 100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
147
- 100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
148
- 100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
149
- [ MIN: 0 ]
150
- [ MAX: 46 ]
151
- [ AVG: 21 ]
152
- [ X-INT: 1 ]
153
- [ Y-INT: 0 ]
154
- |
155
- |
156
- |
157
- |
158
- |
159
- | | | || || | |
160
- | | | || || | |
161
- | | | || || | |
162
- | | | || || | |
163
- | | | || || | |
164
- -----------------------------------------------
165
-
166
- [ VOCABS ]
167
- [ VOC: news with 10 tokens ]
168
- [ COL: nid ]
169
-
170
- [ VOC: cat with 112 tokens ]
171
- [ COL: cat, subCat ]
172
-
173
- [ VOC: english with 30522 tokens ]
174
- [ COL: title, abs ]
175
- ```
176
-
177
- ### ReConstruct Unified Tokenizer
178
-
179
- ```python
180
- unitok = UniTok().add_index_col(
181
- name='nid'
182
- ).add_col(Column(
183
- name='cat',
184
- tok=cat_tok.as_sing()
185
- )).add_col(Column(
186
- name='subCat',
187
- tok=cat_tok.as_sing(),
188
- )).add_col(Column(
189
- name='title',
190
- tok=text_tok,
191
- operator=SeqOperator(max_length=20),
192
- )).add_col(Column(
193
- name='abs',
194
- tok=text_tok,
195
- operator=SeqOperator(max_length=30),
196
- )).read_file(df)
197
- ```
198
-
199
- In this step, we set _max_length_ of each column. If _max_length_ is not set, we will keep the **whole** sequence and not truncate it.
200
-
201
- ### Tokenize and Store
202
-
203
- ```python
204
- unitok.tokenize()
205
- unitok.store_data('TokenizedData')
206
- ```
207
-
208
-
UniTok-3.0.11/README.md DELETED
@@ -1,194 +0,0 @@
1
- # Unified Tokenizer
2
-
3
- > Instructions for **3.0** version will be updated soon!
4
-
5
- ## Introduction
6
-
7
- Unified Tokenizer, shortly **UniTok**,
8
- offers pre-defined various tokenizers in dealing with textual data.
9
- It is a central data processing tool
10
- that allows algorithm engineers to focus more on the algorithm itself
11
- instead of tedious data preprocessing.
12
-
13
- It incorporates the BERT tokenizer from the [transformers]((https://github.com/huggingface/transformers)) library,
14
- while it supports custom via the general word segmentation module (i.e., the `BaseTok` class).
15
-
16
- ## Installation
17
-
18
- `pip install UnifiedTokenizer`
19
-
20
- ## Usage
21
-
22
- We use the head of the training set of the [MINDlarge](https://msnews.github.io/) dataset as an example (see `news-sample.tsv` file).
23
-
24
- ### Data Declaration (more info see [MIND GitHub](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md))
25
-
26
- Each line in the file is a piece of news, including 7 features, which are divided by the tab (`\t`) symbol:
27
-
28
- - News ID
29
- - Category
30
- - SubCategory
31
- - Title
32
- - Abstract
33
- - URL
34
- - Title Entities (entities contained in the title of this news)
35
- - Abstract Entities (entities contained in the abstract of this news)
36
-
37
- We only use its first 5 columns for demonstration.
38
-
39
- ### Pre-defined Tokenizers
40
-
41
- | Tokenizer | Description | Parameters |
42
- |-----------|-----------------------------------------------------------------------|-------------|
43
- | BertTok | Provided by the ``transformers` library, using the WordPiece strategy | `vocab_dir` |
44
- | EntTok | The column data is regarded as an entire token | / |
45
- | IdTok | A specific version of EntTok, required to be identical | / |
46
- | SplitTok | Tokens are joined by separators like tab, space | `sep` |
47
-
48
- ### Imports
49
-
50
- ```python
51
- import pandas as pd
52
-
53
-
54
- from UniTok import UniTok, Column
55
- from UniTok.tok import IdTok, EntTok, BertTok
56
- ```
57
-
58
- ### Read data
59
-
60
- ```python
61
- import pandas as pd
62
-
63
- df = pd.read_csv(
64
- filepath_or_buffer='path/news-sample.tsv',
65
- sep='\t',
66
- names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
67
- usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
68
- )
69
- ```
70
-
71
- ### Construct UniTok
72
-
73
- ```python
74
- cat_tok = EntTok(name='cat') # one tokenizer for both cat and subCat
75
- text_tok = BertTok(name='english', vocab_dir='bert-base-uncased') # specify the bert vocab
76
-
77
- unitok = UniTok().add_index_col(
78
- name='nid'
79
- ).add_col(Column(
80
- name='cat',
81
- tok=cat_tok
82
- )).add_col(Column(
83
- name='subCat',
84
- tok=cat_tok,
85
- )).add_col(Column(
86
- name='title',
87
- tok=text_tok,
88
- )).add_col(Column(
89
- name='abs',
90
- tok=text_tok,
91
- )).read_file(df)
92
- ```
93
-
94
- ### Analyse Data
95
-
96
- ```python
97
- unitok.analyse()
98
- ```
99
-
100
- It shows the distribution of the length of each column (if using _ListTokenizer_). It will help us determine the _max_length_ of the tokens for each column.
101
-
102
- ```
103
- [ COLUMNS ]
104
- [ COL: nid ]
105
- [NOT ListTokenizer]
106
-
107
- [ COL: cat ]
108
- [NOT ListTokenizer]
109
-
110
- [ COL: subCat ]
111
- [NOT ListTokenizer]
112
-
113
- [ COL: title ]
114
- [ MIN: 6 ]
115
- [ MAX: 16 ]
116
- [ AVG: 12 ]
117
- [ X-INT: 1 ]
118
- [ Y-INT: 0 ]
119
- |
120
- |
121
- |
122
- |
123
- || |
124
- || |
125
- || |
126
- | | | || |
127
- | | | || |
128
- | | | || |
129
- -----------
130
-
131
- [ COL: abs ]
132
- 100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
133
- 100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
134
- 100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
135
- 100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
136
- 100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
137
- [ MIN: 0 ]
138
- [ MAX: 46 ]
139
- [ AVG: 21 ]
140
- [ X-INT: 1 ]
141
- [ Y-INT: 0 ]
142
- |
143
- |
144
- |
145
- |
146
- |
147
- | | | || || | |
148
- | | | || || | |
149
- | | | || || | |
150
- | | | || || | |
151
- | | | || || | |
152
- -----------------------------------------------
153
-
154
- [ VOCABS ]
155
- [ VOC: news with 10 tokens ]
156
- [ COL: nid ]
157
-
158
- [ VOC: cat with 112 tokens ]
159
- [ COL: cat, subCat ]
160
-
161
- [ VOC: english with 30522 tokens ]
162
- [ COL: title, abs ]
163
- ```
164
-
165
- ### ReConstruct Unified Tokenizer
166
-
167
- ```python
168
- unitok = UniTok().add_index_col(
169
- name='nid'
170
- ).add_col(Column(
171
- name='cat',
172
- tok=cat_tok.as_sing()
173
- )).add_col(Column(
174
- name='subCat',
175
- tok=cat_tok.as_sing(),
176
- )).add_col(Column(
177
- name='title',
178
- tok=text_tok,
179
- operator=SeqOperator(max_length=20),
180
- )).add_col(Column(
181
- name='abs',
182
- tok=text_tok,
183
- operator=SeqOperator(max_length=30),
184
- )).read_file(df)
185
- ```
186
-
187
- In this step, we set _max_length_ of each column. If _max_length_ is not set, we will keep the **whole** sequence and not truncate it.
188
-
189
- ### Tokenize and Store
190
-
191
- ```python
192
- unitok.tokenize()
193
- unitok.store_data('TokenizedData')
194
- ```
@@ -1,208 +0,0 @@
1
- Metadata-Version: 2.1
2
- Name: UniTok
3
- Version: 3.0.11
4
- Summary: Unified Tokenizer
5
- Home-page: https://github.com/Jyonn/UnifiedTokenizer
6
- Author: Jyonn Liu
7
- Author-email: i@6-79.cn
8
- License: MIT Licence
9
- Keywords: token,tokenizer,bert
10
- Platform: any
11
- Description-Content-Type: text/markdown
12
-
13
- # Unified Tokenizer
14
-
15
- > Instructions for **3.0** version will be updated soon!
16
-
17
- ## Introduction
18
-
19
- Unified Tokenizer, shortly **UniTok**,
20
- offers pre-defined various tokenizers in dealing with textual data.
21
- It is a central data processing tool
22
- that allows algorithm engineers to focus more on the algorithm itself
23
- instead of tedious data preprocessing.
24
-
25
- It incorporates the BERT tokenizer from the [transformers]((https://github.com/huggingface/transformers)) library,
26
- while it supports custom via the general word segmentation module (i.e., the `BaseTok` class).
27
-
28
- ## Installation
29
-
30
- `pip install UnifiedTokenizer`
31
-
32
- ## Usage
33
-
34
- We use the head of the training set of the [MINDlarge](https://msnews.github.io/) dataset as an example (see `news-sample.tsv` file).
35
-
36
- ### Data Declaration (more info see [MIND GitHub](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md))
37
-
38
- Each line in the file is a piece of news, including 7 features, which are divided by the tab (`\t`) symbol:
39
-
40
- - News ID
41
- - Category
42
- - SubCategory
43
- - Title
44
- - Abstract
45
- - URL
46
- - Title Entities (entities contained in the title of this news)
47
- - Abstract Entities (entities contained in the abstract of this news)
48
-
49
- We only use its first 5 columns for demonstration.
50
-
51
- ### Pre-defined Tokenizers
52
-
53
- | Tokenizer | Description | Parameters |
54
- |-----------|-----------------------------------------------------------------------|-------------|
55
- | BertTok | Provided by the ``transformers` library, using the WordPiece strategy | `vocab_dir` |
56
- | EntTok | The column data is regarded as an entire token | / |
57
- | IdTok | A specific version of EntTok, required to be identical | / |
58
- | SplitTok | Tokens are joined by separators like tab, space | `sep` |
59
-
60
- ### Imports
61
-
62
- ```python
63
- import pandas as pd
64
-
65
-
66
- from UniTok import UniTok, Column
67
- from UniTok.tok import IdTok, EntTok, BertTok
68
- ```
69
-
70
- ### Read data
71
-
72
- ```python
73
- import pandas as pd
74
-
75
- df = pd.read_csv(
76
- filepath_or_buffer='path/news-sample.tsv',
77
- sep='\t',
78
- names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
79
- usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
80
- )
81
- ```
82
-
83
- ### Construct UniTok
84
-
85
- ```python
86
- cat_tok = EntTok(name='cat') # one tokenizer for both cat and subCat
87
- text_tok = BertTok(name='english', vocab_dir='bert-base-uncased') # specify the bert vocab
88
-
89
- unitok = UniTok().add_index_col(
90
- name='nid'
91
- ).add_col(Column(
92
- name='cat',
93
- tok=cat_tok
94
- )).add_col(Column(
95
- name='subCat',
96
- tok=cat_tok,
97
- )).add_col(Column(
98
- name='title',
99
- tok=text_tok,
100
- )).add_col(Column(
101
- name='abs',
102
- tok=text_tok,
103
- )).read_file(df)
104
- ```
105
-
106
- ### Analyse Data
107
-
108
- ```python
109
- unitok.analyse()
110
- ```
111
-
112
- It shows the distribution of the length of each column (if using _ListTokenizer_). It will help us determine the _max_length_ of the tokens for each column.
113
-
114
- ```
115
- [ COLUMNS ]
116
- [ COL: nid ]
117
- [NOT ListTokenizer]
118
-
119
- [ COL: cat ]
120
- [NOT ListTokenizer]
121
-
122
- [ COL: subCat ]
123
- [NOT ListTokenizer]
124
-
125
- [ COL: title ]
126
- [ MIN: 6 ]
127
- [ MAX: 16 ]
128
- [ AVG: 12 ]
129
- [ X-INT: 1 ]
130
- [ Y-INT: 0 ]
131
- |
132
- |
133
- |
134
- |
135
- || |
136
- || |
137
- || |
138
- | | | || |
139
- | | | || |
140
- | | | || |
141
- -----------
142
-
143
- [ COL: abs ]
144
- 100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
145
- 100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
146
- 100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
147
- 100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
148
- 100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
149
- [ MIN: 0 ]
150
- [ MAX: 46 ]
151
- [ AVG: 21 ]
152
- [ X-INT: 1 ]
153
- [ Y-INT: 0 ]
154
- |
155
- |
156
- |
157
- |
158
- |
159
- | | | || || | |
160
- | | | || || | |
161
- | | | || || | |
162
- | | | || || | |
163
- | | | || || | |
164
- -----------------------------------------------
165
-
166
- [ VOCABS ]
167
- [ VOC: news with 10 tokens ]
168
- [ COL: nid ]
169
-
170
- [ VOC: cat with 112 tokens ]
171
- [ COL: cat, subCat ]
172
-
173
- [ VOC: english with 30522 tokens ]
174
- [ COL: title, abs ]
175
- ```
176
-
177
- ### ReConstruct Unified Tokenizer
178
-
179
- ```python
180
- unitok = UniTok().add_index_col(
181
- name='nid'
182
- ).add_col(Column(
183
- name='cat',
184
- tok=cat_tok.as_sing()
185
- )).add_col(Column(
186
- name='subCat',
187
- tok=cat_tok.as_sing(),
188
- )).add_col(Column(
189
- name='title',
190
- tok=text_tok,
191
- operator=SeqOperator(max_length=20),
192
- )).add_col(Column(
193
- name='abs',
194
- tok=text_tok,
195
- operator=SeqOperator(max_length=30),
196
- )).read_file(df)
197
- ```
198
-
199
- In this step, we set _max_length_ of each column. If _max_length_ is not set, we will keep the **whole** sequence and not truncate it.
200
-
201
- ### Tokenize and Store
202
-
203
- ```python
204
- unitok.tokenize()
205
- unitok.store_data('TokenizedData')
206
- ```
207
-
208
-
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes