PyPI - robo-lib - Versions diffs - 0.0.4__tar.gz → 0.0.5__tar.gz - Mend

robo-lib 0.0.4tar.gz → 0.0.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

robo_lib-0.0.5/PKG-INFO +243 -0
robo_lib-0.0.5/README.md +226 -0
{robo_lib-0.0.4 → robo_lib-0.0.5}/pyproject.toml +2 -2
{robo_lib-0.0.4 → robo_lib-0.0.5}/robo_lib/components.py +12 -6
robo_lib-0.0.4/PKG-INFO +0 -18
robo_lib-0.0.4/README.md +0 -1
{robo_lib-0.0.4 → robo_lib-0.0.5}/LICENSE +0 -0
{robo_lib-0.0.4 → robo_lib-0.0.5}/robo_lib/__init__.py +0 -0
{robo_lib-0.0.4 → robo_lib-0.0.5}/tests/__init__.py +0 -0

robo_lib-0.0.5/PKG-INFO ADDED Viewed

@@ -0,0 +1,243 @@
+Metadata-Version: 2.3
+Name: robo_lib
+Version: 0.0.5
+Summary: A package to create, configure, and train transformer models.
+Project-URL: Homepage, https://github.com/hamburgerfish/robo_pack
+Project-URL: Issues, https://github.com/hamburgerfish/robo_pack/issues
+Author-email: Erik Papp <erik3papp@gmail.com>
+License-File: LICENSE
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Requires-Python: >=3.8
+Requires-Dist: numpy
+Requires-Dist: tokenizers
+Requires-Dist: torch
+Description-Content-Type: text/markdown
+# robo-lib
+provides tools for creating, configuring, and training custom transformer models on any data available to you.
+## Main features:
+- Customize and train tokenizers using an implementation of the features from the [tokenizers](https://pypi.org/project/tokenizers/#description) library.
+- Customize data processor to process data into individual tensors, ready to be used to train transformers without further processing.
+- Configure transformer models to fit specific requirements/specifications without having to write the internal logic.
+- Use the 3 components to create, train, and use custom transformers in different applications.
+## Installation
+```bash
+pip install robo-lib
+```
+## using robo-lib
+Documentation can be found [here](https://github.com/hamburgerfish/robo_pack/wiki).
+### Language translation example
+- In this example, an encoder-decoder transformer is created for language translation, from English to French.
+- This example uses two .txt files for training, one with English, and the other with the equivalent French sentence in each line (delimited by "\n").
+- Create, train, and save tokenizers using `TokenizerConstructor`.
+- In this example, the WordLevel tokenizer is used, along with the detault arguments of `TokenizerConstructor`.
+```python
+import robo_lib as rl
+encoder_tok = rl.TokenizerConstructor(tokenizer_type="WordLevel")
+encoder_tok.train("english_data.txt")
+decoder_tok = rl.TokenizerConstructor(tokenizer_type="WordLevel")
+encoder_tok.train("french_data.txt")
+rl.save_component(encoder_tok, "tokenizers/encoder_tok.pkl")
+rl.save_component(decoder_tok, "tokenizers/decoder_tok.pkl")
+```
+- The `DataProcessor` can be used to automatically process the data into a single torch.tensor, easily useable by the transformer for training.
+- The tokenizer(s) must be specified when initialising a DataProcessor. In this case the dec_tokenizer, and enc_tokenizer is both specified for an encoder-decoder transformer.
+- The `process_list` method processes lists of string data, so our .txt files are read into lists to be processed by `process_list`.
+- In this example, we are splitting the data 90% : 10% for training and validation.
+```python
+proc = rl.DataProcessor(dec_tokenizer=decoder_tok, enc_tokenizer=encoder_tok)
+# read training .txt files into lists
+with open("english_data.txt", "r") as file:
+    english_list = file.read().split("\n")
+with open("french_data.txt", "r") as file:
+    french_list = file.read().split("\n")
+# splitting lists into train and validation sets
+split = 0.9
+n = int(len(english_list) * 0.9)
+english_train = english_list[:n]
+french_train = french_list[:n]
+english_val = english_list[n:]
+french_val = french_list[n:]
+# process and save training data as data/training*.pt
+# block_size_exceeded_policy="skip" removes training data larger than specified block size
+proc.process_list(
+    save_path="data/training",
+    dec_data=french_train,
+    dec_max_block_size=100,
+    dec_block_size_exceeded_policy="skip",
+    enc_data=english_train,
+    enc_max_block_size=100,
+    enc_block_size_exceeded_policy="skip"
+)
+# process and save validation data as data/validation*.pt
+proc.process_list(
+    save_path="data/validation",
+    dec_data=french_val,
+    dec_max_block_size=100,
+    dec_block_size_exceeded_policy="skip",
+    enc_data=english_val,
+    enc_max_block_size=100,
+    enc_block_size_exceeded_policy="skip"
+)
+```
+- The `RoboConstructor` class is used to create and configure transformer models before trainin.
+- A separate .py file is recommended for training.
+- If device is not specified, `RoboConstructor` will take the first available one out of ("cuda", "mps", "cpu"). Torch cuda is not part of the dependencies when installing robo-lib, so it is highly recommended to install it, using this [link](https://pytorch.org/get-started/locally/), if you have a CUDA compatible device.
+- The `train` method is used to train the transformer and save it to `save_path` every `eval_interval` iterations.
+- If a non-`TokenizerConstructor` token is used, the pad token if your tokenizer can be specified instead of the dec_tokenizer parameter.
+```python
+import robo_lib as rl
+encoder_tok = rl.load_component("tokenizers/encoder_tok.pkl")
+decoder_tok = rl.load_component("tokenizers/decoder_tok.pkl")
+robo = rl.RoboConstructor(
+    n_embed=512,
+    dec_n_blocks=6,
+    dec_n_head=8,
+    dec_vocab_size=decoder_tok.vocab_size,
+    dec_block_size=100,
+    enc_n_blocks=6,
+    enc_n_head=8,
+    enc_vocab_size=encoder_tok.vocab_size,
+    enc_block_size=100
+)
+robo.train(
+    max_iters=20000,
+    eval_interval=200,
+    batch_size=128,
+    dec_training_path="data/training_decoder_data.pt",
+    dec_eval_path="data/validation_decoder_data.pt",
+    dec_training_masks_path="data/training_decoder_mask_data.pt",
+    dec_eval_masks_path="data/validation_decoder_mask_data.pt",
+    enc_training_path="data/training_encoder_data.pt",
+    enc_eval_path="data/validation_encoder_data.pt",
+    enc_training_masks_path="data/training_encoder_mask_data.pt",
+    enc_eval_masks_path="data/validation_encoder_mask_data.pt",
+    dec_tokenizer=decoder_tok,
+    save_path="models/eng_to_fr_robo.pkl"
+)
+```
+- For language translation, a loss of around 3 already shows good results.
+- To use the trained transformer, the `generate` method can be employed.
+- The temperature, top_k and top_p values can be specified for this method, along with the tokenizers used.
+- If a non-`TokenizerConstructor` tokenizer is used, the start, end, separator (decoder-only), and new-line tokens can be specified of your tokenizer.
+- In this example, a simple script is created to interact with the user on the command-line, where the user's English input will be translated by the transformer and printed out onto the console in French.
+```python
+import robo-lib as rl
+robo = rc.load_component("models/eng_to_fr_robo.pkl")
+encoder_tok = rl.load_component("tokenizers/encoder_tok.pkl")
+decoder_tok = rl.load_component("tokenizers/decoder_tok.pkl")
+While True:
+    query = input()
+    print(robo.generate(query, dec_tokenizer=decoder_tok, enc_tokenizer=encoder_tok))
+```
+### Shakespeare dialogue generator example
+- In this example, a decoder-only transformer is created and trained on a file containing all the dialogue written by William Shakespeare in his plays.
+- The training data is in the form of a single .txt file containing the dialogue.
+- The default BPE tokenizer is used in this case, so no argument is specified for `TokenizerConstructor`.
+```python
+import robo_lib as rl
+tok = rl.TokenizerConstructor()
+tok.train("shakespeare_dialogues.txt")
+rl.save_component(tok, "tokenizers/shakespeare_tok.pkl")
+```
+- In this example, instead of having multiple pieces of training data, we have one large text file, from which random chunks of length `block_size` can be used for training. Therefore, a single large string is input into the DataProcessor instead of a list of strings.
+- Since this is a decoder-only transformer, encoder arguments are not given.
+- Since the entire string should be processed as is, instead of creating blocks of training data, block_size is not specified.
+- dec_create_masks is set to False, as there will be no padding in the training data.
+```python
+proc = rl.DataProcessor(dec_tokenizer=tok)
+# read training .txt file
+with open("shakespeare_dialogues.txt", "r") as file:
+    dialogues_str = file.read()
+# splitting string into train and validation sets
+split = 0.9
+n = int(len(dialogues_str) * 0.9)
+train_data = dialogues_str[:n]
+val_data = dialogues_str[n:]
+# process and save training data as data/shakespeare_train*.pt
+proc.process_list(
+    save_path="data/shakespeare_train",
+    dec_data=train_data,
+    dec_create_masks=False
+    )
+# process and save validation data as data/validation*.pt
+proc.process_list(
+    save_path="data/shakespeare_valid",
+    dec_data=val_data,
+    dec_create_masks=False
+)
+```
+- Training the transformer.
+```python
+import robo_lib as rl
+tok = rl.load_component("tokenizers/shakespeare_tok.pkl")
+robo = rl.RoboConstructor(
+    n_embed=1024,
+    dec_n_blocks=8,
+    dec_n_head=8,
+    dec_vocab_size=tok.vocab_size,
+    dec_block_size=200
+)
+robo.train(
+    max_iters=20000,
+    eval_interval=200,
+    batch_size=64,
+    dec_training_path="data/shakespeare_train_decoder_data.pt",
+    dec_eval_path="data/shakespeare_valid_decoder_data.pt",
+    dec_tokenizer=tok,
+    save_path="models/shakespeare_robo.pkl"
+)
+```
+- In this example, the user can specify the start of the generated Shakespeare play and the transformer will generate and print the rest, until `max_new_tokens` (1000) tokens are generated.
+- Temperature and top_k are set to 1.2 and 2 respectively to generate a more "creative" output.
+```python
+import robo-lib as rl
+robo = rc.load_component("models/shakespeare_robo.pkl")
+tok = rl.load_component("tokenizers/shakespeare_tok.pkl")
+While True:
+    start = input()
+    print(robo.generate(start, max_new_tokens=1000, dec_tokenizer=tok, temperature=1.2, top_k=2))
+```

robo_lib-0.0.5/README.md ADDED Viewed

@@ -0,0 +1,226 @@
+# robo-lib
+provides tools for creating, configuring, and training custom transformer models on any data available to you.
+## Main features:
+- Customize and train tokenizers using an implementation of the features from the [tokenizers](https://pypi.org/project/tokenizers/#description) library.
+- Customize data processor to process data into individual tensors, ready to be used to train transformers without further processing.
+- Configure transformer models to fit specific requirements/specifications without having to write the internal logic.
+- Use the 3 components to create, train, and use custom transformers in different applications.
+## Installation
+```bash
+pip install robo-lib
+```
+## using robo-lib
+Documentation can be found [here](https://github.com/hamburgerfish/robo_pack/wiki).
+### Language translation example
+- In this example, an encoder-decoder transformer is created for language translation, from English to French.
+- This example uses two .txt files for training, one with English, and the other with the equivalent French sentence in each line (delimited by "\n").
+- Create, train, and save tokenizers using `TokenizerConstructor`.
+- In this example, the WordLevel tokenizer is used, along with the detault arguments of `TokenizerConstructor`.
+```python
+import robo_lib as rl
+encoder_tok = rl.TokenizerConstructor(tokenizer_type="WordLevel")
+encoder_tok.train("english_data.txt")
+decoder_tok = rl.TokenizerConstructor(tokenizer_type="WordLevel")
+encoder_tok.train("french_data.txt")
+rl.save_component(encoder_tok, "tokenizers/encoder_tok.pkl")
+rl.save_component(decoder_tok, "tokenizers/decoder_tok.pkl")
+```
+- The `DataProcessor` can be used to automatically process the data into a single torch.tensor, easily useable by the transformer for training.
+- The tokenizer(s) must be specified when initialising a DataProcessor. In this case the dec_tokenizer, and enc_tokenizer is both specified for an encoder-decoder transformer.
+- The `process_list` method processes lists of string data, so our .txt files are read into lists to be processed by `process_list`.
+- In this example, we are splitting the data 90% : 10% for training and validation.
+```python
+proc = rl.DataProcessor(dec_tokenizer=decoder_tok, enc_tokenizer=encoder_tok)
+# read training .txt files into lists
+with open("english_data.txt", "r") as file:
+    english_list = file.read().split("\n")
+with open("french_data.txt", "r") as file:
+    french_list = file.read().split("\n")
+# splitting lists into train and validation sets
+split = 0.9
+n = int(len(english_list) * 0.9)
+english_train = english_list[:n]
+french_train = french_list[:n]
+english_val = english_list[n:]
+french_val = french_list[n:]
+# process and save training data as data/training*.pt
+# block_size_exceeded_policy="skip" removes training data larger than specified block size
+proc.process_list(
+    save_path="data/training",
+    dec_data=french_train,
+    dec_max_block_size=100,
+    dec_block_size_exceeded_policy="skip",
+    enc_data=english_train,
+    enc_max_block_size=100,
+    enc_block_size_exceeded_policy="skip"
+)
+# process and save validation data as data/validation*.pt
+proc.process_list(
+    save_path="data/validation",
+    dec_data=french_val,
+    dec_max_block_size=100,
+    dec_block_size_exceeded_policy="skip",
+    enc_data=english_val,
+    enc_max_block_size=100,
+    enc_block_size_exceeded_policy="skip"
+)
+```
+- The `RoboConstructor` class is used to create and configure transformer models before trainin.
+- A separate .py file is recommended for training.
+- If device is not specified, `RoboConstructor` will take the first available one out of ("cuda", "mps", "cpu"). Torch cuda is not part of the dependencies when installing robo-lib, so it is highly recommended to install it, using this [link](https://pytorch.org/get-started/locally/), if you have a CUDA compatible device.
+- The `train` method is used to train the transformer and save it to `save_path` every `eval_interval` iterations.
+- If a non-`TokenizerConstructor` token is used, the pad token if your tokenizer can be specified instead of the dec_tokenizer parameter.
+```python
+import robo_lib as rl
+encoder_tok = rl.load_component("tokenizers/encoder_tok.pkl")
+decoder_tok = rl.load_component("tokenizers/decoder_tok.pkl")
+robo = rl.RoboConstructor(
+    n_embed=512,
+    dec_n_blocks=6,
+    dec_n_head=8,
+    dec_vocab_size=decoder_tok.vocab_size,
+    dec_block_size=100,
+    enc_n_blocks=6,
+    enc_n_head=8,
+    enc_vocab_size=encoder_tok.vocab_size,
+    enc_block_size=100
+)
+robo.train(
+    max_iters=20000,
+    eval_interval=200,
+    batch_size=128,
+    dec_training_path="data/training_decoder_data.pt",
+    dec_eval_path="data/validation_decoder_data.pt",
+    dec_training_masks_path="data/training_decoder_mask_data.pt",
+    dec_eval_masks_path="data/validation_decoder_mask_data.pt",
+    enc_training_path="data/training_encoder_data.pt",
+    enc_eval_path="data/validation_encoder_data.pt",
+    enc_training_masks_path="data/training_encoder_mask_data.pt",
+    enc_eval_masks_path="data/validation_encoder_mask_data.pt",
+    dec_tokenizer=decoder_tok,
+    save_path="models/eng_to_fr_robo.pkl"
+)
+```
+- For language translation, a loss of around 3 already shows good results.
+- To use the trained transformer, the `generate` method can be employed.
+- The temperature, top_k and top_p values can be specified for this method, along with the tokenizers used.
+- If a non-`TokenizerConstructor` tokenizer is used, the start, end, separator (decoder-only), and new-line tokens can be specified of your tokenizer.
+- In this example, a simple script is created to interact with the user on the command-line, where the user's English input will be translated by the transformer and printed out onto the console in French.
+```python
+import robo-lib as rl
+robo = rc.load_component("models/eng_to_fr_robo.pkl")
+encoder_tok = rl.load_component("tokenizers/encoder_tok.pkl")
+decoder_tok = rl.load_component("tokenizers/decoder_tok.pkl")
+While True:
+    query = input()
+    print(robo.generate(query, dec_tokenizer=decoder_tok, enc_tokenizer=encoder_tok))
+```
+### Shakespeare dialogue generator example
+- In this example, a decoder-only transformer is created and trained on a file containing all the dialogue written by William Shakespeare in his plays.
+- The training data is in the form of a single .txt file containing the dialogue.
+- The default BPE tokenizer is used in this case, so no argument is specified for `TokenizerConstructor`.
+```python
+import robo_lib as rl
+tok = rl.TokenizerConstructor()
+tok.train("shakespeare_dialogues.txt")
+rl.save_component(tok, "tokenizers/shakespeare_tok.pkl")
+```
+- In this example, instead of having multiple pieces of training data, we have one large text file, from which random chunks of length `block_size` can be used for training. Therefore, a single large string is input into the DataProcessor instead of a list of strings.
+- Since this is a decoder-only transformer, encoder arguments are not given.
+- Since the entire string should be processed as is, instead of creating blocks of training data, block_size is not specified.
+- dec_create_masks is set to False, as there will be no padding in the training data.
+```python
+proc = rl.DataProcessor(dec_tokenizer=tok)
+# read training .txt file
+with open("shakespeare_dialogues.txt", "r") as file:
+    dialogues_str = file.read()
+# splitting string into train and validation sets
+split = 0.9
+n = int(len(dialogues_str) * 0.9)
+train_data = dialogues_str[:n]
+val_data = dialogues_str[n:]
+# process and save training data as data/shakespeare_train*.pt
+proc.process_list(
+    save_path="data/shakespeare_train",
+    dec_data=train_data,
+    dec_create_masks=False
+    )
+# process and save validation data as data/validation*.pt
+proc.process_list(
+    save_path="data/shakespeare_valid",
+    dec_data=val_data,
+    dec_create_masks=False
+)
+```
+- Training the transformer.
+```python
+import robo_lib as rl
+tok = rl.load_component("tokenizers/shakespeare_tok.pkl")
+robo = rl.RoboConstructor(
+    n_embed=1024,
+    dec_n_blocks=8,
+    dec_n_head=8,
+    dec_vocab_size=tok.vocab_size,
+    dec_block_size=200
+)
+robo.train(
+    max_iters=20000,
+    eval_interval=200,
+    batch_size=64,
+    dec_training_path="data/shakespeare_train_decoder_data.pt",
+    dec_eval_path="data/shakespeare_valid_decoder_data.pt",
+    dec_tokenizer=tok,
+    save_path="models/shakespeare_robo.pkl"
+)
+```
+- In this example, the user can specify the start of the generated Shakespeare play and the transformer will generate and print the rest, until `max_new_tokens` (1000) tokens are generated.
+- Temperature and top_k are set to 1.2 and 2 respectively to generate a more "creative" output.
+```python
+import robo-lib as rl
+robo = rc.load_component("models/shakespeare_robo.pkl")
+tok = rl.load_component("tokenizers/shakespeare_tok.pkl")
+While True:
+    start = input()
+    print(robo.generate(start, max_new_tokens=1000, dec_tokenizer=tok, temperature=1.2, top_k=2))
+```

{robo_lib-0.0.4 → robo_lib-0.0.5}/pyproject.toml RENAMED Viewed

@@ -4,11 +4,11 @@ build-backend = "hatchling.build"
 [project]
 name = "robo_lib"
-version = "0.0.4"
+version = "0.0.5"
 authors = [
   { name="Erik Papp", email="erik3papp@gmail.com" },
 ]
-description = "A package to configure, create and train transformer models."
+description = "A package to create, configure, and train transformer models."
 readme = "README.md"
 requires-python = ">=3.8"
 dependencies = ["torch", "tokenizers", "numpy"]

{robo_lib-0.0.4 → robo_lib-0.0.5}/robo_lib/components.py RENAMED Viewed

@@ -202,8 +202,8 @@ class DataProcessor:
                      dec_create_masks:bool=True,
                      dec_block_size_exceeded_policy:str=None,
                      enc_data:list[str]=None,
-                     enc_create_masks=True,
                      enc_max_block_size:int=None,
+                     enc_create_masks:bool=True,
                      enc_block_size_exceeded_policy:str=None
                      ) -> None:
@@ -646,7 +646,14 @@ class RoboConstructor(nn.Module):
             self.encoder_blocks = MySequential(*[EncoderBlock(n_embed, enc_n_head, enc_expansion_factor, dropout=dropout) for _ in range(enc_n_blocks)])
         else:
             self.cross_attention = False
+            self.enc_n_blocks = None
+            self.enc_n_head = None
+            self.enc_expansion_factor = None
+            self.enc_vocab_size = None
             self.enc_block_size = None
+            self.enc_token_embedding_table = None
+            self.enc_positional_embedding_table = None
+            self.encoder_blocks = None
         self.decoder_blocks = MySequential(*[DecoderBlock(n_embed, dec_n_head, dec_expansion_factor, cross_attention=self.cross_attention, block_size=self.dec_block_size, dropout=dropout) for _ in range(dec_n_blocks)])
         self.ln = nn.LayerNorm(n_embed)
@@ -734,7 +741,7 @@ class RoboConstructor(nn.Module):
               eval_iters:int=3,
               learning_rate:float=1e-4,
               pad_token:int=None,
-              tokenizer:TokenizerConstructor=None,
+              dec_tokenizer:TokenizerConstructor=None,
               save_path:str=None,
               label_smoothing:float=0.1
               ) -> None:
@@ -748,8 +755,8 @@ class RoboConstructor(nn.Module):
         enc_training_masks_data = torch.load(enc_training_masks_path, weights_only=True) if enc_training_masks_path != None else None
         enc_eval_masks_data = torch.load(enc_eval_masks_path, weights_only=True) if enc_eval_masks_path != None else None
-        if pad_token == None and tokenizer != None:
-            pad_token = tokenizer.pad_token
+        if pad_token == None and dec_tokenizer != None:
+            pad_token = dec_tokenizer.pad_token
         self.to(self.device)
@@ -797,7 +804,6 @@ class RoboConstructor(nn.Module):
         self.eval()
-    # use dec and enc tokenizers
     def generate(self,
                 inputs:list[int]|str,
                 max_new_tokens:int=None,
@@ -805,8 +811,8 @@ class RoboConstructor(nn.Module):
                 enc_tokenizer:TokenizerConstructor=None,
                 dec_start_token:int=None,
                 enc_start_token:int=None,
-                enc_end_token:int=None,
                 dec_end_token:int=None,
+                enc_end_token:int=None,
                 separator_token:int=None,
                 new_line_token:int=None,
                 temperature:float=1,

robo_lib-0.0.4/PKG-INFO DELETED Viewed

@@ -1,18 +0,0 @@
-Metadata-Version: 2.3
-Name: robo_lib
-Version: 0.0.4
-Summary: A package to configure, create and train transformer models.
-Project-URL: Homepage, https://github.com/hamburgerfish/robo_pack
-Project-URL: Issues, https://github.com/hamburgerfish/robo_pack/issues
-Author-email: Erik Papp <erik3papp@gmail.com>
-License-File: LICENSE
-Classifier: License :: OSI Approved :: MIT License
-Classifier: Operating System :: OS Independent
-Classifier: Programming Language :: Python :: 3
-Requires-Python: >=3.8
-Requires-Dist: numpy
-Requires-Dist: tokenizers
-Requires-Dist: torch
-Description-Content-Type: text/markdown
-# robo_pack

robo_lib-0.0.4/README.md DELETED Viewed

	@@ -1 +0,0 @@
1	- # robo_pack

{robo_lib-0.0.4 → robo_lib-0.0.5}/LICENSE RENAMED Viewed

File without changes

{robo_lib-0.0.4 → robo_lib-0.0.5}/robo_lib/__init__.py RENAMED Viewed

File without changes

{robo_lib-0.0.4 → robo_lib-0.0.5}/tests/__init__.py RENAMED Viewed

File without changes

robo-lib 0.0.4__tar.gz → 0.0.5__tar.gz

robo-lib 0.0.4tar.gz → 0.0.5tar.gz