PyPI - UniTok - Versions diffs - 3.0.11__tar.gz → 3.0.13__tar.gz - Mend

UniTok 3.0.11tar.gz → 3.0.13tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

UniTok-3.0.13/PKG-INFO ADDED Viewed

@@ -0,0 +1,142 @@
+Metadata-Version: 2.1
+Name: UniTok
+Version: 3.0.13
+Summary: Unified Tokenizer
+Home-page: https://github.com/Jyonn/UnifiedTokenizer
+Author: Jyonn Liu
+Author-email: i@6-79.cn
+License: MIT Licence
+Keywords: token,tokenizer,bert
+Platform: any
+Description-Content-Type: text/markdown
+# UniTok V3.0
+## 介绍
+UniTok是一个面向机器学习的统一文本数据预处理工具。它提供了一系列预定义的分词器，以便于处理不同类型的文本数据。UniTok简单易上手，让算法工程师能更专注优化算法本身，大大降低了数据预处理的难度。
+UniDep是UniTok预处理后数据的解析工具，能和PyTorch的Dataset类配合使用。
+## 安装
+`pip install unitok>=3.0.11`
+## 使用
+我们以新闻推荐系统场景为例，数据集可能包含以下部分：
+- 新闻内容数据`(news.tsv)`：每一行是一条新闻，包含新闻ID、新闻标题、摘要、类别、子类别等多个特征，用`\t`分隔。
+- 用户历史数据`(user.tsv)`：每一行是一位用户，包含用户ID和用户历史点击新闻的ID列表，新闻ID用` `分隔。
+- 交互数据：包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录，包含用户ID、新闻ID、是否点击，用`\t`分隔。
+我们首先分析以上每个属性的数据类型：
+| 文件        | 属性       | 类型  | 样例                                                                   | 备注                      |
+|-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
+| news.tsv  | nid      | str | N1234                                                                | 新闻ID，唯一标识               |
+| news.tsv  | title    | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题，通常用BertTokenizer分词 |
+| news.tsv  | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now.      | 新闻摘要，通常用BertTokenizer分词 |
+| news.tsv  | category | str | Technology                                                           | 新闻类别，不可分割               |
+| news.tsv  | subcat   | str | Mobile                                                               | 新闻子类别，不可分割              |
+| user.tsv  | uid      | str | U1234                                                                | 用户ID，唯一标识               |
+| user.tsv  | history  | str | N1234 N1235 N1236                                                    | 用户历史，被` `分割             |
+| train.tsv | uid      | str | U1234                                                                | 用户ID，与`user.tsv`一致      |
+| train.tsv | nid      | str | N1234                                                                | 新闻ID，与`news.tsv`一致      |
+| train.tsv | label    | int | 1                                                                    | 是否点击，0表示未点击，1表示点击       |
+我们可以对以上属性进行分类：
+| 属性               | 类型  | 预设分词器     | 备注                                  |
+|------------------|-----|-----------|-------------------------------------|
+| nid, uid, index  | str | IdTok     | 唯一标识                                |
+| title, abstract  | str | BertTok   | 指定参数`vocab_dir="bert-base-uncased"` |
+| category, subcat | str | EntityTok | 不可分割                                |
+| history          | str | SplitTok  | 指定参数`sep=' '`                       |
+| label            | int | NumberTok | 指定参数`vocab_size=2`，只有0和1两种情况        |
+通过以下代码，我们可以针对每个文件构建一个UniTok对象：
+```python
+from UniTok import UniTok, Column, Vocab
+from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok
+nid_vocab = Vocab('nid')  # 在新闻数据、历史数据和交互数据中都会用到
+eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng')  # 用于处理英文文本
+news_ut = UniTok().add_col(Column(
+    name='nid',  # 数据栏名称，如果和tok的name一致，可以省略
+    tok=IdTok(vocab=nid_vocab),  # 指定分词器，每个UniTok对象必须有且只能有一个IdTok
+)).add_col(Column(
+    name='title',
+    tok=eng_tok,
+    max_length=20,  # 指定最大长度，超过的部分会被截断
+)).add_col(Column(
+    name='abstract',
+    tok=eng_tok,  # 摘要和标题使用同一个分词器
+    max_length=30,  # 指定最大长度，超过的部分会被截断
+)).add_col(Column(
+    name='category',
+    tok=EntTok(name='cat'),  # 不显式指定Vocab，会根据name自动创建
+)).add_col(Column(
+    name='subcat',
+    tok=EntTok(name='subcat'),
+))
+news_ut.read('news.tsv', sep='\t')  # 读取数据文件
+news_ut.tokenize()  # 运行分词
+news_ut.store('data/news')  # 保存分词结果
+uid_vocab = Vocab('uid')  # 在用户数据和交互数据中都会用到
+user_ut = UniTok().add_col(Column(
+    name='uid',
+    tok=IdTok(vocab=uid_vocab),
+)).add_col(Column(
+    name='history',
+    tok=SplitTok(sep=' '),  # 历史数据中的新闻ID用空格分割
+))
+user_ut.read('user.tsv', sep='\t')  # 读取数据文件
+user_ut.tokenize()  # 运行分词
+user_ut.store('data/user')  # 保存分词结果
+def inter_tokenize(mode):
+    # 由于train/dev/test的index不同，每次预处理前都需要重新构建UniTok对象
+    # 如果不重新构建，index词表可能不准确，导致元数据和真实数据不一致
+    # 但通过UniDep解析数据后，能修正index的误差
+    inter_ut = UniTok().add_index_col(
+        # 交互数据中的index列是自动生成的，不需要指定分词器
+    ).add_col(Column(
+        name='uid',
+        tok=IdTok(vocab=uid_vocab),  # 指定和user_ut中的uid列一致
+    )).add_col(Column(
+        name='nid',
+        tok=IdTok(vocab=nid_vocab),  # 指定和news_ut中的nid列一致
+    )).add_col(Column(
+        name='label',
+        tok=NumberTok(vocab_size=2),  # 0和1两种情况，>=3.0.11版本支持
+    ))
+    inter_ut.read(f'{mode}.tsv', sep='\t')  # 读取数据文件
+    inter_ut.tokenize()  # 运行分词
+    inter_ut.store(mode)  # 保存分词结果
+inter_tokenize('data/train')
+inter_tokenize('data/dev')
+inter_tokenize('data/test')
+```
+我们可以用UniDep解析数据：
+```python
+from UniTok import UniDep
+news_dep = UniDep('data/news')  # 读取分词结果
+print(len(news_dep))
+print(news_dep[0])
+```

UniTok-3.0.13/README.md ADDED Viewed

@@ -0,0 +1,128 @@
+# UniTok V3.0
+## 介绍
+UniTok是一个面向机器学习的统一文本数据预处理工具。它提供了一系列预定义的分词器，以便于处理不同类型的文本数据。UniTok简单易上手，让算法工程师能更专注优化算法本身，大大降低了数据预处理的难度。
+UniDep是UniTok预处理后数据的解析工具，能和PyTorch的Dataset类配合使用。
+## 安装
+`pip install unitok>=3.0.11`
+## 使用
+我们以新闻推荐系统场景为例，数据集可能包含以下部分：
+- 新闻内容数据`(news.tsv)`：每一行是一条新闻，包含新闻ID、新闻标题、摘要、类别、子类别等多个特征，用`\t`分隔。
+- 用户历史数据`(user.tsv)`：每一行是一位用户，包含用户ID和用户历史点击新闻的ID列表，新闻ID用` `分隔。
+- 交互数据：包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录，包含用户ID、新闻ID、是否点击，用`\t`分隔。
+我们首先分析以上每个属性的数据类型：
+| 文件        | 属性       | 类型  | 样例                                                                   | 备注                      |
+|-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
+| news.tsv  | nid      | str | N1234                                                                | 新闻ID，唯一标识               |
+| news.tsv  | title    | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题，通常用BertTokenizer分词 |
+| news.tsv  | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now.      | 新闻摘要，通常用BertTokenizer分词 |
+| news.tsv  | category | str | Technology                                                           | 新闻类别，不可分割               |
+| news.tsv  | subcat   | str | Mobile                                                               | 新闻子类别，不可分割              |
+| user.tsv  | uid      | str | U1234                                                                | 用户ID，唯一标识               |
+| user.tsv  | history  | str | N1234 N1235 N1236                                                    | 用户历史，被` `分割             |
+| train.tsv | uid      | str | U1234                                                                | 用户ID，与`user.tsv`一致      |
+| train.tsv | nid      | str | N1234                                                                | 新闻ID，与`news.tsv`一致      |
+| train.tsv | label    | int | 1                                                                    | 是否点击，0表示未点击，1表示点击       |
+我们可以对以上属性进行分类：
+| 属性               | 类型  | 预设分词器     | 备注                                  |
+|------------------|-----|-----------|-------------------------------------|
+| nid, uid, index  | str | IdTok     | 唯一标识                                |
+| title, abstract  | str | BertTok   | 指定参数`vocab_dir="bert-base-uncased"` |
+| category, subcat | str | EntityTok | 不可分割                                |
+| history          | str | SplitTok  | 指定参数`sep=' '`                       |
+| label            | int | NumberTok | 指定参数`vocab_size=2`，只有0和1两种情况        |
+通过以下代码，我们可以针对每个文件构建一个UniTok对象：
+```python
+from UniTok import UniTok, Column, Vocab
+from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok
+nid_vocab = Vocab('nid')  # 在新闻数据、历史数据和交互数据中都会用到
+eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng')  # 用于处理英文文本
+news_ut = UniTok().add_col(Column(
+    name='nid',  # 数据栏名称，如果和tok的name一致，可以省略
+    tok=IdTok(vocab=nid_vocab),  # 指定分词器，每个UniTok对象必须有且只能有一个IdTok
+)).add_col(Column(
+    name='title',
+    tok=eng_tok,
+    max_length=20,  # 指定最大长度，超过的部分会被截断
+)).add_col(Column(
+    name='abstract',
+    tok=eng_tok,  # 摘要和标题使用同一个分词器
+    max_length=30,  # 指定最大长度，超过的部分会被截断
+)).add_col(Column(
+    name='category',
+    tok=EntTok(name='cat'),  # 不显式指定Vocab，会根据name自动创建
+)).add_col(Column(
+    name='subcat',
+    tok=EntTok(name='subcat'),
+))
+news_ut.read('news.tsv', sep='\t')  # 读取数据文件
+news_ut.tokenize()  # 运行分词
+news_ut.store('data/news')  # 保存分词结果
+uid_vocab = Vocab('uid')  # 在用户数据和交互数据中都会用到
+user_ut = UniTok().add_col(Column(
+    name='uid',
+    tok=IdTok(vocab=uid_vocab),
+)).add_col(Column(
+    name='history',
+    tok=SplitTok(sep=' '),  # 历史数据中的新闻ID用空格分割
+))
+user_ut.read('user.tsv', sep='\t')  # 读取数据文件
+user_ut.tokenize()  # 运行分词
+user_ut.store('data/user')  # 保存分词结果
+def inter_tokenize(mode):
+    # 由于train/dev/test的index不同，每次预处理前都需要重新构建UniTok对象
+    # 如果不重新构建，index词表可能不准确，导致元数据和真实数据不一致
+    # 但通过UniDep解析数据后，能修正index的误差
+    inter_ut = UniTok().add_index_col(
+        # 交互数据中的index列是自动生成的，不需要指定分词器
+    ).add_col(Column(
+        name='uid',
+        tok=IdTok(vocab=uid_vocab),  # 指定和user_ut中的uid列一致
+    )).add_col(Column(
+        name='nid',
+        tok=IdTok(vocab=nid_vocab),  # 指定和news_ut中的nid列一致
+    )).add_col(Column(
+        name='label',
+        tok=NumberTok(vocab_size=2),  # 0和1两种情况，>=3.0.11版本支持
+    ))
+    inter_ut.read(f'{mode}.tsv', sep='\t')  # 读取数据文件
+    inter_ut.tokenize()  # 运行分词
+    inter_ut.store(mode)  # 保存分词结果
+inter_tokenize('data/train')
+inter_tokenize('data/dev')
+inter_tokenize('data/test')
+```
+我们可以用UniDep解析数据：
+```python
+from UniTok import UniDep
+news_dep = UniDep('data/news')  # 读取分词结果
+print(len(news_dep))
+print(news_dep[0])
+```

{UniTok-3.0.11 → UniTok-3.0.13}/UniTok/column.py RENAMED Viewed

@@ -42,9 +42,9 @@ class Column:
         tok (BaseTok): The tokenizer of the column.
         operator (SeqOperator): The operator of the column.
     """
-    def __init__(self, name, tok: BaseTok, operator: SeqOperator = None, **kwargs):
-        self.name = name
+    def __init__(self, tok: BaseTok, name=None, operator: SeqOperator = None, **kwargs):
         self.tok = tok
+        self.name = name or tok.vocab.name
         self.operator = operator
         if kwargs:
@@ -115,4 +115,4 @@ class Column:
 class IndexColumn(Column):
     def __init__(self, name='index'):
-        super().__init__(name, tok=IdTok(name=name))
+        super().__init__(tok=IdTok(name=name))

{UniTok-3.0.11 → UniTok-3.0.13}/UniTok/meta.py RENAMED Viewed

@@ -127,3 +127,15 @@ class Meta:
             print('Old meta data backed up to {}.'.format(self.path + '.bak'))
             self.save()
             print('Meta data upgraded.')
+    @property
+    def col_info(self):
+        warnings.warn('col_info is deprecated, use cols instead.'
+                      '(meta.col_info -> meta.cols)', DeprecationWarning)
+        return self.cols
+    @property
+    def vocab_info(self):
+        warnings.warn('vocab_info is deprecated, use vocs instead.'
+                      '(meta.vocab_info -> meta.vocs)', DeprecationWarning)
+        return self.vocs

{UniTok-3.0.11 → UniTok-3.0.13}/UniTok/tok/__init__.py RENAMED Viewed

@@ -1,5 +1,5 @@
 from .bert_tok import BertTok
-from .entity_tok import EntTok
+from .ent_tok import EntTok
 from .id_tok import IdTok
 from .split_tok import SplitTok
 from .number_tok import NumberTok

UniTok-3.0.11/UniTok/tok/entity_tok.py → UniTok-3.0.13/UniTok/tok/ent_tok.py RENAMED Viewed

@@ -9,7 +9,7 @@ class EntTok(BaseTok):
         ...: other arguments of the BaseTok
     Example:
-        >>> from UniTok.tok.entity_tok import EntTok
+        >>> from UniTok.tok.ent_tok import EntTok
         >>> tok = EntTok(name='entity')
         >>> tok('JJ Lin')  # 0
         >>> tok('Jay Chou')  # 1

{UniTok-3.0.11 → UniTok-3.0.13}/UniTok/unidep.py RENAMED Viewed

@@ -104,7 +104,28 @@ class UniDep:
         return self.sample_size
     def __str__(self):
-        return f'UniDep from {self.store_dir}'
+        """        UniDep (dir):
+        Sample Size: 1000
+        Id Column: id
+        Columns:
+            id, vocab index (size 1000)
+            text, vocab eng (size 30522), max length 100
+            label, vocab label (size 2)
+        """
+        introduction = f"""
+        UniDep ({self.meta.parse_version(self.meta.version)}): {self.store_dir}
+        Sample Size: {self.sample_size}
+        Id Column: {self.id_col}
+        Columns:\n"""
+        for col_name, col in self.cols.items():  # type: str, Col
+            introduction += f'        \t{col_name}, vocab {col.voc.name} (size {col.voc.size})'
+            if col.max_length:
+                introduction += f', max length {col.max_length}'
+            introduction += '\n'
+        return introduction
     def __repr__(self):
         return str(self)

{UniTok-3.0.11 → UniTok-3.0.13}/UniTok/unitok.py RENAMED Viewed

@@ -9,7 +9,7 @@ import pandas as pd
 from .cols import Cols
 from .column import Column, IndexColumn
 from .tok.bert_tok import BertTok
-from .tok.entity_tok import EntTok
+from .tok.ent_tok import EntTok
 from .tok.id_tok import IdTok
 from .vocab import Vocab
 from .vocabs import Vocabs
@@ -97,7 +97,7 @@ class UniTok:
         self.id_col = col
         return self
-    def read_file(self, df, sep=None):
+    def read(self, df, sep=None):
         """
         Read data from a file
         """
@@ -107,6 +107,11 @@ class UniTok:
         self.data = df
         return self
+    def read_file(self, df, sep=None):
+        warnings.warn('read_file is deprecated, use read instead '
+                      '(will be removed in 4.x version)', DeprecationWarning)
+        return self.read(df, sep)
     def __getitem__(self, col):
         """
         Get the data of a column
@@ -166,6 +171,14 @@ class UniTok:
         return self.cols[col_name].tok.vocab.get_store_path(store_dir)
     def store_data(self, store_dir):
+        """
+        Store the tokenized data
+        """
+        warnings.warn('unitok.store_data is deprecated, use store instead '
+                      '(will be removed in 4.x version)', DeprecationWarning)
+        self.store(store_dir)
+    def store(self, store_dir):
         """
         Store the tokenized data
         """

{UniTok-3.0.11 → UniTok-3.0.13}/UniTok/vocab.py RENAMED Viewed

@@ -94,7 +94,7 @@ class Vocab:
         """
         set first n tokens as reserved tokens
         """
-        if self.get_size():
+        if len(self):
             raise ValueError(f'vocab {self.name} is not empty, can not reserve tokens')
         self.reserved_tokens = tokens
@@ -107,17 +107,25 @@ class Vocab:
         return self
     def get_tokens(self):
-        return [self.i2o[i] for i in range(len(self))]
+        warnings.warn('vocab.get_tokens is deprecated, '
+                      'use list(vocab) instead (will be removed in 4.x version)', DeprecationWarning)
+        return list(self)
     def get_size(self):
-        return len(self.i2o)
+        warnings.warn('vocab.get_size is deprecated, '
+                      'use len(vocab) instead (will be removed in 4.x version)', DeprecationWarning)
+        return len(self)
     def __len__(self):
-        return self.get_size()
+        return len(self.i2o)
     def __bool__(self):
         return True
+    def __iter__(self):
+        for i in range(len(self)):
+            yield self.i2o[i]
     """
     Editable Methods
     """
@@ -163,8 +171,8 @@ class Vocab:
     def save(self, store_dir):
         store_path = self.get_store_path(store_dir)
         with open(store_path, 'w') as f:
-            for i in range(len(self.i2o)):
-                f.write('{}\n'.format(self.i2o[i]))
+            for token in self:
+                f.write('{}\n'.format(token))
         return self
@@ -184,8 +192,10 @@ class Vocab:
     def trim(self, min_count=None, min_frequency=1):
         """
         trim vocab by min frequency
-        :return:
+        :return: trimmed tokens
         """
+        _trimmed = []
         if min_count is None:
             warnings.warn('vocab.min_frequency is deprecated, '
                           'use vocab.min_count instead (will be removed in 4.x version)', DeprecationWarning)
@@ -195,6 +205,8 @@ class Vocab:
         for index in self._counter:
             if self._counter[index] >= min_count:
                 vocabs.append(self.i2o[index])
+            else:
+                _trimmed.append(self.i2o[index])
         self.i2o = dict()
         self.o2i = dict()
@@ -205,7 +217,7 @@ class Vocab:
         self.extend(vocabs)
         self._stable_mode = True
-        return self
+        return _trimmed
     def summarize(self, base=10):
         """

{UniTok-3.0.11 → UniTok-3.0.13}/UniTok/vocabs.py RENAMED Viewed

@@ -41,7 +41,7 @@ class Vocabs(dict):
         Get the information of all vocabs
         """
         return {vocab.name: dict(
-            size=vocab.get_size(),
+            size=len(vocab),
             cols=self.cols[vocab.name],
         ) for vocab in self.values()}

UniTok-3.0.13/UniTok.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,142 @@
+Metadata-Version: 2.1
+Name: UniTok
+Version: 3.0.13
+Summary: Unified Tokenizer
+Home-page: https://github.com/Jyonn/UnifiedTokenizer
+Author: Jyonn Liu
+Author-email: i@6-79.cn
+License: MIT Licence
+Keywords: token,tokenizer,bert
+Platform: any
+Description-Content-Type: text/markdown
+# UniTok V3.0
+## 介绍
+UniTok是一个面向机器学习的统一文本数据预处理工具。它提供了一系列预定义的分词器，以便于处理不同类型的文本数据。UniTok简单易上手，让算法工程师能更专注优化算法本身，大大降低了数据预处理的难度。
+UniDep是UniTok预处理后数据的解析工具，能和PyTorch的Dataset类配合使用。
+## 安装
+`pip install unitok>=3.0.11`
+## 使用
+我们以新闻推荐系统场景为例，数据集可能包含以下部分：
+- 新闻内容数据`(news.tsv)`：每一行是一条新闻，包含新闻ID、新闻标题、摘要、类别、子类别等多个特征，用`\t`分隔。
+- 用户历史数据`(user.tsv)`：每一行是一位用户，包含用户ID和用户历史点击新闻的ID列表，新闻ID用` `分隔。
+- 交互数据：包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录，包含用户ID、新闻ID、是否点击，用`\t`分隔。
+我们首先分析以上每个属性的数据类型：
+| 文件        | 属性       | 类型  | 样例                                                                   | 备注                      |
+|-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
+| news.tsv  | nid      | str | N1234                                                                | 新闻ID，唯一标识               |
+| news.tsv  | title    | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题，通常用BertTokenizer分词 |
+| news.tsv  | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now.      | 新闻摘要，通常用BertTokenizer分词 |
+| news.tsv  | category | str | Technology                                                           | 新闻类别，不可分割               |
+| news.tsv  | subcat   | str | Mobile                                                               | 新闻子类别，不可分割              |
+| user.tsv  | uid      | str | U1234                                                                | 用户ID，唯一标识               |
+| user.tsv  | history  | str | N1234 N1235 N1236                                                    | 用户历史，被` `分割             |
+| train.tsv | uid      | str | U1234                                                                | 用户ID，与`user.tsv`一致      |
+| train.tsv | nid      | str | N1234                                                                | 新闻ID，与`news.tsv`一致      |
+| train.tsv | label    | int | 1                                                                    | 是否点击，0表示未点击，1表示点击       |
+我们可以对以上属性进行分类：
+| 属性               | 类型  | 预设分词器     | 备注                                  |
+|------------------|-----|-----------|-------------------------------------|
+| nid, uid, index  | str | IdTok     | 唯一标识                                |
+| title, abstract  | str | BertTok   | 指定参数`vocab_dir="bert-base-uncased"` |
+| category, subcat | str | EntityTok | 不可分割                                |
+| history          | str | SplitTok  | 指定参数`sep=' '`                       |
+| label            | int | NumberTok | 指定参数`vocab_size=2`，只有0和1两种情况        |
+通过以下代码，我们可以针对每个文件构建一个UniTok对象：
+```python
+from UniTok import UniTok, Column, Vocab
+from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok
+nid_vocab = Vocab('nid')  # 在新闻数据、历史数据和交互数据中都会用到
+eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng')  # 用于处理英文文本
+news_ut = UniTok().add_col(Column(
+    name='nid',  # 数据栏名称，如果和tok的name一致，可以省略
+    tok=IdTok(vocab=nid_vocab),  # 指定分词器，每个UniTok对象必须有且只能有一个IdTok
+)).add_col(Column(
+    name='title',
+    tok=eng_tok,
+    max_length=20,  # 指定最大长度，超过的部分会被截断
+)).add_col(Column(
+    name='abstract',
+    tok=eng_tok,  # 摘要和标题使用同一个分词器
+    max_length=30,  # 指定最大长度，超过的部分会被截断
+)).add_col(Column(
+    name='category',
+    tok=EntTok(name='cat'),  # 不显式指定Vocab，会根据name自动创建
+)).add_col(Column(
+    name='subcat',
+    tok=EntTok(name='subcat'),
+))
+news_ut.read('news.tsv', sep='\t')  # 读取数据文件
+news_ut.tokenize()  # 运行分词
+news_ut.store('data/news')  # 保存分词结果
+uid_vocab = Vocab('uid')  # 在用户数据和交互数据中都会用到
+user_ut = UniTok().add_col(Column(
+    name='uid',
+    tok=IdTok(vocab=uid_vocab),
+)).add_col(Column(
+    name='history',
+    tok=SplitTok(sep=' '),  # 历史数据中的新闻ID用空格分割
+))
+user_ut.read('user.tsv', sep='\t')  # 读取数据文件
+user_ut.tokenize()  # 运行分词
+user_ut.store('data/user')  # 保存分词结果
+def inter_tokenize(mode):
+    # 由于train/dev/test的index不同，每次预处理前都需要重新构建UniTok对象
+    # 如果不重新构建，index词表可能不准确，导致元数据和真实数据不一致
+    # 但通过UniDep解析数据后，能修正index的误差
+    inter_ut = UniTok().add_index_col(
+        # 交互数据中的index列是自动生成的，不需要指定分词器
+    ).add_col(Column(
+        name='uid',
+        tok=IdTok(vocab=uid_vocab),  # 指定和user_ut中的uid列一致
+    )).add_col(Column(
+        name='nid',
+        tok=IdTok(vocab=nid_vocab),  # 指定和news_ut中的nid列一致
+    )).add_col(Column(
+        name='label',
+        tok=NumberTok(vocab_size=2),  # 0和1两种情况，>=3.0.11版本支持
+    ))
+    inter_ut.read(f'{mode}.tsv', sep='\t')  # 读取数据文件
+    inter_ut.tokenize()  # 运行分词
+    inter_ut.store(mode)  # 保存分词结果
+inter_tokenize('data/train')
+inter_tokenize('data/dev')
+inter_tokenize('data/test')
+```
+我们可以用UniDep解析数据：
+```python
+from UniTok import UniDep
+news_dep = UniDep('data/news')  # 读取分词结果
+print(len(news_dep))
+print(news_dep[0])
+```

{UniTok-3.0.11 → UniTok-3.0.13}/UniTok.egg-info/SOURCES.txt RENAMED Viewed

@@ -19,7 +19,7 @@ UniTok/analysis/lengths.py
 UniTok/analysis/plot.py
 UniTok/tok/__init__.py
 UniTok/tok/bert_tok.py
-UniTok/tok/entity_tok.py
+UniTok/tok/ent_tok.py
 UniTok/tok/id_tok.py
 UniTok/tok/number_tok.py
 UniTok/tok/seq_tok.py

{UniTok-3.0.11 → UniTok-3.0.13}/setup.py RENAMED Viewed

@@ -6,7 +6,7 @@ long_description = (this_directory / "README.md").read_text()
 setup(
     name='UniTok',
-    version='3.0.11',
+    version='3.0.13',
     keywords=['token', 'tokenizer', 'bert'],
     description='Unified Tokenizer',
     long_description=long_description,

UniTok-3.0.11/PKG-INFO DELETED Viewed

@@ -1,208 +0,0 @@
-Metadata-Version: 2.1
-Name: UniTok
-Version: 3.0.11
-Summary: Unified Tokenizer
-Home-page: https://github.com/Jyonn/UnifiedTokenizer
-Author: Jyonn Liu
-Author-email: i@6-79.cn
-License: MIT Licence
-Keywords: token,tokenizer,bert
-Platform: any
-Description-Content-Type: text/markdown
-# Unified Tokenizer
-> Instructions for **3.0** version will be updated soon!
-## Introduction
-Unified Tokenizer, shortly **UniTok**,
-offers pre-defined various tokenizers in dealing with textual data.
-It is a central data processing tool
-that allows algorithm engineers to focus more on the algorithm itself
-instead of tedious data preprocessing.
-It incorporates the BERT tokenizer from the [transformers]((https://github.com/huggingface/transformers)) library,
-while it supports custom via the general word segmentation module (i.e., the `BaseTok` class).
-## Installation
-`pip install UnifiedTokenizer`
-## Usage
-We use the head of the training set of the [MINDlarge](https://msnews.github.io/) dataset as an example (see `news-sample.tsv` file).
-### Data Declaration (more info see [MIND GitHub](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md))
-Each line in the file is a piece of news, including 7 features, which are divided by the tab (`\t`) symbol:
-- News ID
-- Category
-- SubCategory
-- Title
-- Abstract
-- URL
-- Title Entities (entities contained in the title of this news)
-- Abstract Entities (entities contained in the abstract of this news)
-We only use its first 5 columns for demonstration.
-### Pre-defined Tokenizers
-| Tokenizer | Description                                                           | Parameters  |
-|-----------|-----------------------------------------------------------------------|-------------|
-| BertTok   | Provided by the ``transformers` library, using the WordPiece strategy | `vocab_dir` |
-| EntTok    | The column data is regarded as an entire token                        | /           |
-| IdTok     | A specific version of EntTok, required to be identical                | /           |
-| SplitTok  | Tokens are joined by separators like tab, space                       | `sep`       |
-### Imports
-```python
-import pandas as pd
-from UniTok import UniTok, Column
-from UniTok.tok import IdTok, EntTok, BertTok
-```
-### Read data
-```python
-import pandas as pd
-df = pd.read_csv(
-    filepath_or_buffer='path/news-sample.tsv',
-    sep='\t',
-    names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
-    usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
-)
-```
-### Construct UniTok
-```python
-cat_tok = EntTok(name='cat')  # one tokenizer for both cat and subCat
-text_tok = BertTok(name='english', vocab_dir='bert-base-uncased')  # specify the bert vocab
-unitok = UniTok().add_index_col(
-    name='nid'
-).add_col(Column(
-    name='cat',
-    tok=cat_tok
-)).add_col(Column(
-    name='subCat',
-    tok=cat_tok,
-)).add_col(Column(
-    name='title',
-    tok=text_tok,
-)).add_col(Column(
-    name='abs',
-    tok=text_tok,
-)).read_file(df)
-```
-### Analyse Data
-```python
-unitok.analyse()
-```
-It shows the distribution of the length of each column (if using _ListTokenizer_). It will help us determine the _max_length_ of the tokens for each column.
-```
-[ COLUMNS ]
-[ COL: nid ]
-[NOT ListTokenizer]
-[ COL: cat ]
-[NOT ListTokenizer]
-[ COL: subCat ]
-[NOT ListTokenizer]
-[ COL: title ]
-[ MIN: 6 ]
-[ MAX: 16 ]
-[ AVG: 12 ]
-[ X-INT: 1 ]
-[ Y-INT: 0 ]
-       |
-       |
-       |
-       |
-       || |
-       || |
-       || |
-| |  | || |
-| |  | || |
-| |  | || |
------------
-[ COL: abs ]
-100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
-100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
-100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
-100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
-100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
-[ MIN: 0 ]
-[ MAX: 46 ]
-[ AVG: 21 ]
-[ X-INT: 1 ]
-[ Y-INT: 0 ]
-|
-|
-|
-|
-|
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
------------------------------------------------
-[ VOCABS ]
-[ VOC: news with  10 tokens ]
-[ COL: nid ]
-[ VOC: cat with  112 tokens ]
-[ COL: cat, subCat ]
-[ VOC: english with  30522 tokens ]
-[ COL: title, abs ]
-```
-### ReConstruct Unified Tokenizer
-```python
-unitok = UniTok().add_index_col(
-    name='nid'
-).add_col(Column(
-    name='cat',
-    tok=cat_tok.as_sing()
-)).add_col(Column(
-    name='subCat',
-    tok=cat_tok.as_sing(),
-)).add_col(Column(
-    name='title',
-    tok=text_tok,
-    operator=SeqOperator(max_length=20),
-)).add_col(Column(
-    name='abs',
-    tok=text_tok,
-    operator=SeqOperator(max_length=30),
-)).read_file(df)
-```
-In this step, we set _max_length_ of each column. If _max_length_ is not set, we will keep the **whole** sequence and not truncate it.
-### Tokenize and Store
-```python
-unitok.tokenize()
-unitok.store_data('TokenizedData')
-```

UniTok-3.0.11/README.md DELETED Viewed

@@ -1,194 +0,0 @@
-# Unified Tokenizer
-> Instructions for **3.0** version will be updated soon!
-## Introduction
-Unified Tokenizer, shortly **UniTok**,
-offers pre-defined various tokenizers in dealing with textual data.
-It is a central data processing tool
-that allows algorithm engineers to focus more on the algorithm itself
-instead of tedious data preprocessing.
-It incorporates the BERT tokenizer from the [transformers]((https://github.com/huggingface/transformers)) library,
-while it supports custom via the general word segmentation module (i.e., the `BaseTok` class).
-## Installation
-`pip install UnifiedTokenizer`
-## Usage
-We use the head of the training set of the [MINDlarge](https://msnews.github.io/) dataset as an example (see `news-sample.tsv` file).
-### Data Declaration (more info see [MIND GitHub](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md))
-Each line in the file is a piece of news, including 7 features, which are divided by the tab (`\t`) symbol:
-- News ID
-- Category
-- SubCategory
-- Title
-- Abstract
-- URL
-- Title Entities (entities contained in the title of this news)
-- Abstract Entities (entities contained in the abstract of this news)
-We only use its first 5 columns for demonstration.
-### Pre-defined Tokenizers
-| Tokenizer | Description                                                           | Parameters  |
-|-----------|-----------------------------------------------------------------------|-------------|
-| BertTok   | Provided by the ``transformers` library, using the WordPiece strategy | `vocab_dir` |
-| EntTok    | The column data is regarded as an entire token                        | /           |
-| IdTok     | A specific version of EntTok, required to be identical                | /           |
-| SplitTok  | Tokens are joined by separators like tab, space                       | `sep`       |
-### Imports
-```python
-import pandas as pd
-from UniTok import UniTok, Column
-from UniTok.tok import IdTok, EntTok, BertTok
-```
-### Read data
-```python
-import pandas as pd
-df = pd.read_csv(
-    filepath_or_buffer='path/news-sample.tsv',
-    sep='\t',
-    names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
-    usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
-)
-```
-### Construct UniTok
-```python
-cat_tok = EntTok(name='cat')  # one tokenizer for both cat and subCat
-text_tok = BertTok(name='english', vocab_dir='bert-base-uncased')  # specify the bert vocab
-unitok = UniTok().add_index_col(
-    name='nid'
-).add_col(Column(
-    name='cat',
-    tok=cat_tok
-)).add_col(Column(
-    name='subCat',
-    tok=cat_tok,
-)).add_col(Column(
-    name='title',
-    tok=text_tok,
-)).add_col(Column(
-    name='abs',
-    tok=text_tok,
-)).read_file(df)
-```
-### Analyse Data
-```python
-unitok.analyse()
-```
-It shows the distribution of the length of each column (if using _ListTokenizer_). It will help us determine the _max_length_ of the tokens for each column.
-```
-[ COLUMNS ]
-[ COL: nid ]
-[NOT ListTokenizer]
-[ COL: cat ]
-[NOT ListTokenizer]
-[ COL: subCat ]
-[NOT ListTokenizer]
-[ COL: title ]
-[ MIN: 6 ]
-[ MAX: 16 ]
-[ AVG: 12 ]
-[ X-INT: 1 ]
-[ Y-INT: 0 ]
-       |
-       |
-       |
-       |
-       || |
-       || |
-       || |
-| |  | || |
-| |  | || |
-| |  | || |
------------
-[ COL: abs ]
-100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
-100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
-100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
-100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
-100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
-[ MIN: 0 ]
-[ MAX: 46 ]
-[ AVG: 21 ]
-[ X-INT: 1 ]
-[ Y-INT: 0 ]
-|
-|
-|
-|
-|
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
------------------------------------------------
-[ VOCABS ]
-[ VOC: news with  10 tokens ]
-[ COL: nid ]
-[ VOC: cat with  112 tokens ]
-[ COL: cat, subCat ]
-[ VOC: english with  30522 tokens ]
-[ COL: title, abs ]
-```
-### ReConstruct Unified Tokenizer
-```python
-unitok = UniTok().add_index_col(
-    name='nid'
-).add_col(Column(
-    name='cat',
-    tok=cat_tok.as_sing()
-)).add_col(Column(
-    name='subCat',
-    tok=cat_tok.as_sing(),
-)).add_col(Column(
-    name='title',
-    tok=text_tok,
-    operator=SeqOperator(max_length=20),
-)).add_col(Column(
-    name='abs',
-    tok=text_tok,
-    operator=SeqOperator(max_length=30),
-)).read_file(df)
-```
-In this step, we set _max_length_ of each column. If _max_length_ is not set, we will keep the **whole** sequence and not truncate it.
-### Tokenize and Store
-```python
-unitok.tokenize()
-unitok.store_data('TokenizedData')
-```

UniTok-3.0.11/UniTok.egg-info/PKG-INFO DELETED Viewed

@@ -1,208 +0,0 @@
-Metadata-Version: 2.1
-Name: UniTok
-Version: 3.0.11
-Summary: Unified Tokenizer
-Home-page: https://github.com/Jyonn/UnifiedTokenizer
-Author: Jyonn Liu
-Author-email: i@6-79.cn
-License: MIT Licence
-Keywords: token,tokenizer,bert
-Platform: any
-Description-Content-Type: text/markdown
-# Unified Tokenizer
-> Instructions for **3.0** version will be updated soon!
-## Introduction
-Unified Tokenizer, shortly **UniTok**,
-offers pre-defined various tokenizers in dealing with textual data.
-It is a central data processing tool
-that allows algorithm engineers to focus more on the algorithm itself
-instead of tedious data preprocessing.
-It incorporates the BERT tokenizer from the [transformers]((https://github.com/huggingface/transformers)) library,
-while it supports custom via the general word segmentation module (i.e., the `BaseTok` class).
-## Installation
-`pip install UnifiedTokenizer`
-## Usage
-We use the head of the training set of the [MINDlarge](https://msnews.github.io/) dataset as an example (see `news-sample.tsv` file).
-### Data Declaration (more info see [MIND GitHub](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md))
-Each line in the file is a piece of news, including 7 features, which are divided by the tab (`\t`) symbol:
-- News ID
-- Category
-- SubCategory
-- Title
-- Abstract
-- URL
-- Title Entities (entities contained in the title of this news)
-- Abstract Entities (entities contained in the abstract of this news)
-We only use its first 5 columns for demonstration.
-### Pre-defined Tokenizers
-| Tokenizer | Description                                                           | Parameters  |
-|-----------|-----------------------------------------------------------------------|-------------|
-| BertTok   | Provided by the ``transformers` library, using the WordPiece strategy | `vocab_dir` |
-| EntTok    | The column data is regarded as an entire token                        | /           |
-| IdTok     | A specific version of EntTok, required to be identical                | /           |
-| SplitTok  | Tokens are joined by separators like tab, space                       | `sep`       |
-### Imports
-```python
-import pandas as pd
-from UniTok import UniTok, Column
-from UniTok.tok import IdTok, EntTok, BertTok
-```
-### Read data
-```python
-import pandas as pd
-df = pd.read_csv(
-    filepath_or_buffer='path/news-sample.tsv',
-    sep='\t',
-    names=['nid', 'cat', 'subCat', 'title', 'abs', 'url', 'titEnt', 'absEnt'],
-    usecols=['nid', 'cat', 'subCat', 'title', 'abs'],
-)
-```
-### Construct UniTok
-```python
-cat_tok = EntTok(name='cat')  # one tokenizer for both cat and subCat
-text_tok = BertTok(name='english', vocab_dir='bert-base-uncased')  # specify the bert vocab
-unitok = UniTok().add_index_col(
-    name='nid'
-).add_col(Column(
-    name='cat',
-    tok=cat_tok
-)).add_col(Column(
-    name='subCat',
-    tok=cat_tok,
-)).add_col(Column(
-    name='title',
-    tok=text_tok,
-)).add_col(Column(
-    name='abs',
-    tok=text_tok,
-)).read_file(df)
-```
-### Analyse Data
-```python
-unitok.analyse()
-```
-It shows the distribution of the length of each column (if using _ListTokenizer_). It will help us determine the _max_length_ of the tokens for each column.
-```
-[ COLUMNS ]
-[ COL: nid ]
-[NOT ListTokenizer]
-[ COL: cat ]
-[NOT ListTokenizer]
-[ COL: subCat ]
-[NOT ListTokenizer]
-[ COL: title ]
-[ MIN: 6 ]
-[ MAX: 16 ]
-[ AVG: 12 ]
-[ X-INT: 1 ]
-[ Y-INT: 0 ]
-       |
-       |
-       |
-       |
-       || |
-       || |
-       || |
-| |  | || |
-| |  | || |
-| |  | || |
------------
-[ COL: abs ]
-100%|██████████| 10/10 [00:00<00:00, 119156.36it/s]
-100%|██████████| 10/10 [00:00<00:00, 166440.63it/s]
-100%|██████████| 10/10 [00:00<00:00, 164482.51it/s]
-100%|██████████| 10/10 [00:00<00:00, 2172.09it/s]
-100%|██████████| 10/10 [00:00<00:00, 1552.30it/s]
-[ MIN: 0 ]
-[ MAX: 46 ]
-[ AVG: 21 ]
-[ X-INT: 1 ]
-[ Y-INT: 0 ]
-|
-|
-|
-|
-|
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
-|               | | ||    ||               |  |
------------------------------------------------
-[ VOCABS ]
-[ VOC: news with  10 tokens ]
-[ COL: nid ]
-[ VOC: cat with  112 tokens ]
-[ COL: cat, subCat ]
-[ VOC: english with  30522 tokens ]
-[ COL: title, abs ]
-```
-### ReConstruct Unified Tokenizer
-```python
-unitok = UniTok().add_index_col(
-    name='nid'
-).add_col(Column(
-    name='cat',
-    tok=cat_tok.as_sing()
-)).add_col(Column(
-    name='subCat',
-    tok=cat_tok.as_sing(),
-)).add_col(Column(
-    name='title',
-    tok=text_tok,
-    operator=SeqOperator(max_length=20),
-)).add_col(Column(
-    name='abs',
-    tok=text_tok,
-    operator=SeqOperator(max_length=30),
-)).read_file(df)
-```
-In this step, we set _max_length_ of each column. If _max_length_ is not set, we will keep the **whole** sequence and not truncate it.
-### Tokenize and Store
-```python
-unitok.tokenize()
-unitok.store_data('TokenizedData')
-```