tokenizerchanger 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,92 @@
1
+ Metadata-Version: 2.1
2
+ Name: TokenizerChanger
3
+ Version: 0.0.1
4
+ Summary: Library for manipulating the existing tokenizer.
5
+ Home-page: https://github.com/1kkiRen/Tokenizer-Changer
6
+ Author: 1kkiren
7
+ Author-email: 1kkiren@mail.ru
8
+ Project-URL: GitHub, https://github.com/1kkiRen/Tokenizer-Changer
9
+ Keywords: tokenizer deletion tokens
10
+ Classifier: Programming Language :: Python :: 3.10
11
+ Classifier: License :: OSI Approved :: Apache Software License
12
+ Classifier: Operating System :: OS Independent
13
+ Requires-Python: >=3.9
14
+ Description-Content-Type: text/markdown
15
+
16
+ # Tokens-Deletion
17
+ Python script for manipulating the existing tokenizer.
18
+
19
+ The solution was tested on Llama3-8B tokenizer.
20
+
21
+ -----
22
+ # Usage:
23
+
24
+ ```python
25
+ changer = TokenizerChanger(tokenizer)
26
+ ```
27
+ Create the object of `TokenizerChanger` class that requires an existing tokenizer that will be changed, e.g. `PreTrainedTokenizerFast` class from ั€ัŸยคโ€” Tokenizers library.
28
+
29
+ ## Deletion:
30
+ ```python
31
+ changer.delete_k_least_frequent_tokens(k=1000)
32
+ changer.delete_k_least_frequent_tokens(k=1000, exclude=list_of_tokens)
33
+ ```
34
+ Deletes k most frequent tokens. The `exclude` argument stands for tokens that will be ignored during the deletion of least frequent tokens.
35
+
36
+ ```python
37
+ changer.delete_unwanted_tokens(list_of_unwanted_tokens)
38
+ ```
39
+ Deletes all tokens from `list_of_unwanted_tokens` from the tokenizer.
40
+
41
+ ```python
42
+ changer.delete_tokens(list_of_unwanted_tokens)
43
+ ```
44
+ Now, you can delete exactly the list of unwanted tokens, in contrast to the `delete_unwanted_tokens` function, which deletes all tokens from the list and tokens that contain unwanted tokens as a substring.
45
+
46
+ ```python
47
+ changer.delete_overlaps(vocab)
48
+ ```
49
+ Finds and deletes all intersections of the `tokenizer`'s vocabulary and the `vocab` variable from the `tokenizer`. Notice that `vocab` should be a `dict` variable.
50
+
51
+ ```python
52
+ changer.delete_inappropriate_merges(vocab)
53
+ ```
54
+ Deletes all merges from `tokenizer` which contradict the `vocab` variable. Notice that `vocab` should be a `list[str]` variable.
55
+
56
+
57
+ ## Addition:
58
+ The idea of creating such functions arose due to the fact that the built-in functions do not add tokens/merges properly, when some tokens are deleted. That is why you can get more tokens after encoding the same text, even if the necessary tokens have been added.
59
+
60
+ ```python
61
+ changer.add_tokens(list_of_tokens)
62
+ ```
63
+ Adds the tokens from the list. The indexes will be filled automatically.
64
+
65
+ ```python
66
+ changer.add_merges(list_of_merges)
67
+ ```
68
+ Adds the merges from the list.
69
+
70
+
71
+ ## "Get" functions:
72
+ ```python
73
+ changer.get_overlapping_tokens(vocab)
74
+ ```
75
+ Returns the intersection between the `tokenizer`'s vocabulary and the `vocab` variable. Notice that `vocab` should be a `dict` variable.
76
+
77
+ ```python
78
+ changer.get_overlapping_megres(merges)
79
+ ```
80
+ Returns the intersection between the `tokenizer`'s merges and the `merges` variable. Notice that `merges` should be a `list` variable.
81
+
82
+
83
+ ## Saving:
84
+ ```python
85
+ changer.save_tokenizer(path)
86
+ ```
87
+ Saves the current state of the changed tokenizer. Additionally, it saves tokenizer configs into `path` folder (`./updated_tokenizer` by default).
88
+
89
+ ```python
90
+ tokenizer = ch.updated_tokenizer()
91
+ ```
92
+ Return the changed tokenizer.
@@ -0,0 +1,77 @@
1
+ # Tokens-Deletion
2
+ Python script for manipulating the existing tokenizer.
3
+
4
+ The solution was tested on Llama3-8B tokenizer.
5
+
6
+ -----
7
+ # Usage:
8
+
9
+ ```python
10
+ changer = TokenizerChanger(tokenizer)
11
+ ```
12
+ Create the object of `TokenizerChanger` class that requires an existing tokenizer that will be changed, e.g. `PreTrainedTokenizerFast` class from ๐Ÿค— Tokenizers library.
13
+
14
+ ## Deletion:
15
+ ```python
16
+ changer.delete_k_least_frequent_tokens(k=1000)
17
+ changer.delete_k_least_frequent_tokens(k=1000, exclude=list_of_tokens)
18
+ ```
19
+ Deletes k most frequent tokens. The `exclude` argument stands for tokens that will be ignored during the deletion of least frequent tokens.
20
+
21
+ ```python
22
+ changer.delete_unwanted_tokens(list_of_unwanted_tokens)
23
+ ```
24
+ Deletes all tokens from `list_of_unwanted_tokens` from the tokenizer.
25
+
26
+ ```python
27
+ changer.delete_tokens(list_of_unwanted_tokens)
28
+ ```
29
+ Now, you can delete exactly the list of unwanted tokens, in contrast to the `delete_unwanted_tokens` function, which deletes all tokens from the list and tokens that contain unwanted tokens as a substring.
30
+
31
+ ```python
32
+ changer.delete_overlaps(vocab)
33
+ ```
34
+ Finds and deletes all intersections of the `tokenizer`'s vocabulary and the `vocab` variable from the `tokenizer`. Notice that `vocab` should be a `dict` variable.
35
+
36
+ ```python
37
+ changer.delete_inappropriate_merges(vocab)
38
+ ```
39
+ Deletes all merges from `tokenizer` which contradict the `vocab` variable. Notice that `vocab` should be a `list[str]` variable.
40
+
41
+
42
+ ## Addition:
43
+ The idea of creating such functions arose due to the fact that the built-in functions do not add tokens/merges properly, when some tokens are deleted. That is why you can get more tokens after encoding the same text, even if the necessary tokens have been added.
44
+
45
+ ```python
46
+ changer.add_tokens(list_of_tokens)
47
+ ```
48
+ Adds the tokens from the list. The indexes will be filled automatically.
49
+
50
+ ```python
51
+ changer.add_merges(list_of_merges)
52
+ ```
53
+ Adds the merges from the list.
54
+
55
+
56
+ ## "Get" functions:
57
+ ```python
58
+ changer.get_overlapping_tokens(vocab)
59
+ ```
60
+ Returns the intersection between the `tokenizer`'s vocabulary and the `vocab` variable. Notice that `vocab` should be a `dict` variable.
61
+
62
+ ```python
63
+ changer.get_overlapping_megres(merges)
64
+ ```
65
+ Returns the intersection between the `tokenizer`'s merges and the `merges` variable. Notice that `merges` should be a `list` variable.
66
+
67
+
68
+ ## Saving:
69
+ ```python
70
+ changer.save_tokenizer(path)
71
+ ```
72
+ Saves the current state of the changed tokenizer. Additionally, it saves tokenizer configs into `path` folder (`./updated_tokenizer` by default).
73
+
74
+ ```python
75
+ tokenizer = ch.updated_tokenizer()
76
+ ```
77
+ Return the changed tokenizer.
@@ -0,0 +1,7 @@
1
+ """
2
+ TokenizerChanger library v0.0.1
3
+
4
+ The Apache 2.0 License Copyright ยฉ Dmitrii Kuzmin
5
+ """
6
+
7
+ from .tokenizer_changer import *
@@ -0,0 +1,159 @@
1
+ import json
2
+ from tqdm import tqdm
3
+ from tokenizers import models
4
+ from transformers import PreTrainedTokenizer
5
+
6
+
7
+ class TokenizerChanger:
8
+ def __init__(self, tokenizer: PreTrainedTokenizer):
9
+ self.tokenizer: PreTrainedTokenizer = tokenizer
10
+ self.unwanted_tokens = []
11
+ self.none_types = []
12
+ self.target_changes = 0
13
+ self.model_state = json.loads(
14
+ tokenizer.backend_tokenizer.model.__getstate__())
15
+
16
+ def delete_tokens(self, unwanted_tokens: list[str] = None):
17
+ self.unwanted_tokens = list(set(unwanted_tokens)) if unwanted_tokens else list(
18
+ set(self.unwanted_tokens))
19
+ for token in tqdm(self.unwanted_tokens, desc="Deleting unwanted words"):
20
+ del self.model_state["vocab"][token]
21
+
22
+ def find_least_tokens(self, k_least: int, exclude: list[str] = []):
23
+ self.unwanted_tokens = []
24
+ for k, v in tqdm(dict(reversed(list(self.model_state["vocab"].items()))).items(), desc="Finding unwanted tokens"):
25
+ if len(self.unwanted_tokens) >= k_least:
26
+ break
27
+ if k not in exclude:
28
+ self.unwanted_tokens.append(k)
29
+
30
+ def find_tokens(self, unwanted_tokens: list[str]):
31
+ for token in self.model_state["vocab"]:
32
+ for unwanted_token in unwanted_tokens:
33
+ if unwanted_token in token:
34
+ self.unwanted_tokens.append(token)
35
+
36
+ def delete_merges(self, unwanted_tokens: list[str] = None):
37
+ processed_merges = [(''.join(merge).replace(' ', ''), merge)
38
+ for merge in self.model_state["merges"]]
39
+
40
+ unwanted_merges_set = set()
41
+
42
+ self.unwanted_tokens = list(set(unwanted_tokens)) if unwanted_tokens else list(
43
+ set(self.unwanted_tokens))
44
+
45
+ for processed_merge, original_merge in tqdm(processed_merges, desc="Finding unwanted merges"):
46
+ if any(token in processed_merge for token in self.unwanted_tokens):
47
+ unwanted_merges_set.add(original_merge)
48
+
49
+ self.model_state["merges"] = [merge for merge in tqdm(
50
+ self.model_state["merges"], desc="Deleting unwanted merges") if merge not in unwanted_merges_set]
51
+
52
+ def find_token_id_gap(self):
53
+ reversed_vocab_values = list(
54
+ reversed(self.model_state['vocab'].values()))
55
+ last_gap = 0
56
+ for i in range(len(self.model_state['vocab']) - 1):
57
+ if reversed_vocab_values[i] - reversed_vocab_values[i + 1] > 1:
58
+ last_gap = reversed_vocab_values[i + 1]
59
+
60
+ return last_gap
61
+
62
+ def add_tokens(self, tokens: list[str]):
63
+ i = 1
64
+ border_id = self.find_token_id_gap()
65
+ for token in tqdm(tokens, desc="Adding tokens"):
66
+ if token not in self.model_state["vocab"]:
67
+ while border_id + i in self.model_state['vocab'].values():
68
+ i += 1
69
+ self.model_state["vocab"][token] = border_id + i
70
+ i += 1
71
+
72
+ def add_merges(self, merges: list[str]):
73
+ for merge in tqdm(self.model_state["merges"], desc="Adding merges"):
74
+ merges.append(merge)
75
+
76
+ self.model_state["merges"] = list(set(merges))
77
+
78
+ def delete_inappropriate_merges(self, vocab: list[str]):
79
+ processed_merges = [(''.join(merge).replace(' ', ''), merge)
80
+ for merge in self.model_state["merges"]]
81
+
82
+ unwanted_merges_set = set()
83
+
84
+ for processed_merge, original_merge in tqdm(processed_merges, desc="Finding unwanted merges"):
85
+ if not all(token in vocab for token in [processed_merge, original_merge[0], original_merge[1]]):
86
+ unwanted_merges_set.add(original_merge)
87
+
88
+ self.model_state["merges"] = [merge for merge in tqdm(
89
+ self.model_state["merges"], desc="Deleting unwanted merges") if merge not in unwanted_merges_set]
90
+
91
+ def get_overlapping_tokens(self, vocab: dict):
92
+ overlapping_tokens = []
93
+ for token in tqdm(vocab.keys(), desc="Finding overlapping tokens"):
94
+ if token in self.model_state["vocab"].keys():
95
+ overlapping_tokens.append(token)
96
+ return overlapping_tokens
97
+
98
+ def get_overlapping_megres(self, merges: list):
99
+ overlapping_merges = []
100
+
101
+ processed_merges_new_tokenizer = [(''.join(merge).replace(' ', ''), merge)
102
+ for merge in self.model_state["merges"]]
103
+
104
+ processed_merges_old_tokenizer = [(''.join(merge).replace(' ', ''), merge)
105
+ for merge in merges]
106
+
107
+ for merge in tqdm(processed_merges_new_tokenizer, desc="Finding overlapping merges"):
108
+ if any(merge in processed_merge for processed_merge in processed_merges_old_tokenizer):
109
+ overlapping_merges.append(merge)
110
+
111
+ return overlapping_merges
112
+
113
+ def format_merges(self):
114
+ for i in tqdm(range(len(self.model_state["merges"])), desc="Formating merges"):
115
+ if type(self.model_state["merges"][i]) != tuple:
116
+ self.model_state["merges"][i] = tuple(
117
+ map(str, self.model_state["merges"][i].split()))
118
+
119
+ def delete_none_types(self):
120
+ for k, v in self.model_state.items():
121
+ if v == None:
122
+ self.none_types.append(k)
123
+
124
+ for k in self.none_types:
125
+ del self.model_state[k]
126
+
127
+ def delete_k_least_frequent_tokens(self, k: int, exclude: list[str] = []):
128
+ self.find_least_tokens(k, exclude)
129
+ self.delete_tokens()
130
+ self.delete_merges()
131
+
132
+ def delete_unwanted_tokens(self, unwanted_tokens: list):
133
+ self.find_tokens(unwanted_tokens)
134
+ self.delete_tokens()
135
+ self.delete_merges()
136
+
137
+ def delete_overlaps(self, vocab: dict):
138
+ overlaps = list(set(self.get_overlapping_tokens(vocab)))
139
+ self.delete_tokens(unwanted_tokens=overlaps)
140
+ self.delete_merges()
141
+
142
+ def save_tokenizer(self, path: str = "updated_tokenizer"):
143
+ self.format_merges()
144
+ self.delete_none_types()
145
+
146
+ model_class = getattr(
147
+ models, self.model_state.pop("type")
148
+ )
149
+
150
+ self.tokenizer.backend_tokenizer.model = model_class(
151
+ **self.model_state)
152
+
153
+ self.model_state = json.loads(
154
+ self.tokenizer.backend_tokenizer.model.__getstate__())
155
+
156
+ self.tokenizer.save_pretrained(path)
157
+
158
+ def updated_tokenizer(self) -> PreTrainedTokenizer:
159
+ return self.tokenizer
@@ -0,0 +1,92 @@
1
+ Metadata-Version: 2.1
2
+ Name: TokenizerChanger
3
+ Version: 0.0.1
4
+ Summary: Library for manipulating the existing tokenizer.
5
+ Home-page: https://github.com/1kkiRen/Tokenizer-Changer
6
+ Author: 1kkiren
7
+ Author-email: 1kkiren@mail.ru
8
+ Project-URL: GitHub, https://github.com/1kkiRen/Tokenizer-Changer
9
+ Keywords: tokenizer deletion tokens
10
+ Classifier: Programming Language :: Python :: 3.10
11
+ Classifier: License :: OSI Approved :: Apache Software License
12
+ Classifier: Operating System :: OS Independent
13
+ Requires-Python: >=3.9
14
+ Description-Content-Type: text/markdown
15
+
16
+ # Tokens-Deletion
17
+ Python script for manipulating the existing tokenizer.
18
+
19
+ The solution was tested on Llama3-8B tokenizer.
20
+
21
+ -----
22
+ # Usage:
23
+
24
+ ```python
25
+ changer = TokenizerChanger(tokenizer)
26
+ ```
27
+ Create the object of `TokenizerChanger` class that requires an existing tokenizer that will be changed, e.g. `PreTrainedTokenizerFast` class from ั€ัŸยคโ€” Tokenizers library.
28
+
29
+ ## Deletion:
30
+ ```python
31
+ changer.delete_k_least_frequent_tokens(k=1000)
32
+ changer.delete_k_least_frequent_tokens(k=1000, exclude=list_of_tokens)
33
+ ```
34
+ Deletes k most frequent tokens. The `exclude` argument stands for tokens that will be ignored during the deletion of least frequent tokens.
35
+
36
+ ```python
37
+ changer.delete_unwanted_tokens(list_of_unwanted_tokens)
38
+ ```
39
+ Deletes all tokens from `list_of_unwanted_tokens` from the tokenizer.
40
+
41
+ ```python
42
+ changer.delete_tokens(list_of_unwanted_tokens)
43
+ ```
44
+ Now, you can delete exactly the list of unwanted tokens, in contrast to the `delete_unwanted_tokens` function, which deletes all tokens from the list and tokens that contain unwanted tokens as a substring.
45
+
46
+ ```python
47
+ changer.delete_overlaps(vocab)
48
+ ```
49
+ Finds and deletes all intersections of the `tokenizer`'s vocabulary and the `vocab` variable from the `tokenizer`. Notice that `vocab` should be a `dict` variable.
50
+
51
+ ```python
52
+ changer.delete_inappropriate_merges(vocab)
53
+ ```
54
+ Deletes all merges from `tokenizer` which contradict the `vocab` variable. Notice that `vocab` should be a `list[str]` variable.
55
+
56
+
57
+ ## Addition:
58
+ The idea of creating such functions arose due to the fact that the built-in functions do not add tokens/merges properly, when some tokens are deleted. That is why you can get more tokens after encoding the same text, even if the necessary tokens have been added.
59
+
60
+ ```python
61
+ changer.add_tokens(list_of_tokens)
62
+ ```
63
+ Adds the tokens from the list. The indexes will be filled automatically.
64
+
65
+ ```python
66
+ changer.add_merges(list_of_merges)
67
+ ```
68
+ Adds the merges from the list.
69
+
70
+
71
+ ## "Get" functions:
72
+ ```python
73
+ changer.get_overlapping_tokens(vocab)
74
+ ```
75
+ Returns the intersection between the `tokenizer`'s vocabulary and the `vocab` variable. Notice that `vocab` should be a `dict` variable.
76
+
77
+ ```python
78
+ changer.get_overlapping_megres(merges)
79
+ ```
80
+ Returns the intersection between the `tokenizer`'s merges and the `merges` variable. Notice that `merges` should be a `list` variable.
81
+
82
+
83
+ ## Saving:
84
+ ```python
85
+ changer.save_tokenizer(path)
86
+ ```
87
+ Saves the current state of the changed tokenizer. Additionally, it saves tokenizer configs into `path` folder (`./updated_tokenizer` by default).
88
+
89
+ ```python
90
+ tokenizer = ch.updated_tokenizer()
91
+ ```
92
+ Return the changed tokenizer.
@@ -0,0 +1,10 @@
1
+ README.md
2
+ setup.cfg
3
+ setup.py
4
+ TokenizerChanger/__init__.py
5
+ TokenizerChanger/tokenizer_changer.py
6
+ TokenizerChanger.egg-info/PKG-INFO
7
+ TokenizerChanger.egg-info/SOURCES.txt
8
+ TokenizerChanger.egg-info/dependency_links.txt
9
+ TokenizerChanger.egg-info/requires.txt
10
+ TokenizerChanger.egg-info/top_level.txt
@@ -0,0 +1,3 @@
1
+ tokenizers>=0.19.1
2
+ tqdm>=4.66.4
3
+ transformers>=4.41.2
@@ -0,0 +1 @@
1
+ TokenizerChanger
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,34 @@
1
+ from setuptools import setup, find_packages
2
+
3
+
4
+ def readme():
5
+ with open('README.md', 'r') as f:
6
+ return f.read()
7
+
8
+
9
+ setup(
10
+ name='TokenizerChanger',
11
+ version='0.0.1',
12
+ author='1kkiren',
13
+ author_email='1kkiren@mail.ru',
14
+ description='Library for manipulating the existing tokenizer.',
15
+ long_description=readme(),
16
+ long_description_content_type='text/markdown',
17
+ url='https://github.com/1kkiRen/Tokenizer-Changer',
18
+ packages=find_packages(),
19
+ install_requires=[
20
+ 'tokenizers>=0.19.1',
21
+ 'tqdm>=4.66.4',
22
+ 'transformers>=4.41.2'
23
+ ],
24
+ classifiers=[
25
+ 'Programming Language :: Python :: 3.10',
26
+ 'License :: OSI Approved :: Apache Software License',
27
+ 'Operating System :: OS Independent'
28
+ ],
29
+ keywords='tokenizer deletion tokens ',
30
+ project_urls={
31
+ 'GitHub': 'https://github.com/1kkiRen/Tokenizer-Changer'
32
+ },
33
+ python_requires='>=3.9'
34
+ )