mmgp 2.0.0__tar.gz → 2.0.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of mmgp might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: mmgp
3
- Version: 2.0.0
3
+ Version: 2.0.2
4
4
  Summary: Memory Management for the GPU Poor
5
5
  Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
6
  License: GNU GENERAL PUBLIC LICENSE
@@ -92,13 +92,13 @@ Load the tensors data of a model in RAM of a model already initialized with no d
92
92
 
93
93
  - *fast_load_transformers_model(model_path: str)*\
94
94
  Initialize (build the model hierarchy in memory) and fast load the corresponding tensors of a 'transformers' library model.
95
- The advantages over the original *LoadfromPretrained* function is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
95
+ The advantages over the original *from_pretrained* method is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
96
96
  Please note that you need to keep the original file transformers 'config.json' in the same directory.
97
97
 
98
98
 
99
99
  The typical workflow wil be:
100
100
  1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
101
- 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a 'Loadfrompretrained' call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
101
+ 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
102
102
 
103
103
  ## Special cases
104
104
  Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
@@ -78,13 +78,13 @@ Load the tensors data of a model in RAM of a model already initialized with no d
78
78
 
79
79
  - *fast_load_transformers_model(model_path: str)*\
80
80
  Initialize (build the model hierarchy in memory) and fast load the corresponding tensors of a 'transformers' library model.
81
- The advantages over the original *LoadfromPretrained* function is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
81
+ The advantages over the original *from_pretrained* method is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
82
82
  Please note that you need to keep the original file transformers 'config.json' in the same directory.
83
83
 
84
84
 
85
85
  The typical workflow wil be:
86
86
  1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
87
- 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a 'Loadfrompretrained' call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
87
+ 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
88
88
 
89
89
  ## Special cases
90
90
  Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "mmgp"
3
- version = "2.0.0"
3
+ version = "2.0.2"
4
4
  authors = [
5
5
  { name = "deepbeepmeep", email = "deepbeepmeep@yahoo.com" },
6
6
  ]
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: mmgp
3
- Version: 2.0.0
3
+ Version: 2.0.2
4
4
  Summary: Memory Management for the GPU Poor
5
5
  Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
6
  License: GNU GENERAL PUBLIC LICENSE
@@ -92,13 +92,13 @@ Load the tensors data of a model in RAM of a model already initialized with no d
92
92
 
93
93
  - *fast_load_transformers_model(model_path: str)*\
94
94
  Initialize (build the model hierarchy in memory) and fast load the corresponding tensors of a 'transformers' library model.
95
- The advantages over the original *LoadfromPretrained* function is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
95
+ The advantages over the original *from_pretrained* method is that the full model can fit into a single file with a filename of your choosing (thefore you can have multiple 'transformers' versions of the same model in the same directory) and prequantized model are processed in a transparent way.
96
96
  Please note that you need to keep the original file transformers 'config.json' in the same directory.
97
97
 
98
98
 
99
99
  The typical workflow wil be:
100
100
  1) temporarly insert the *save_model* function just after a model has been fully loaded to save a copy of the model / quantized model.
101
- 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a 'Loadfrompretrained' call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
101
+ 2) replace the full initalizing / loading logic with *fast_load_transformers_model* (if there is a *from_pretrained* call to a transformers object) or only the tensor loading functions (*torch.load_model_file* and *torch.load_state_dict*) with *load_model_data after* the initializing logic.
102
102
 
103
103
  ## Special cases
104
104
  Sometime there isn't an explicit pipe object as each submodel is loaded separately in the main app. If this is the case, you need to create a dictionary that manually maps all the models.\
@@ -680,9 +680,9 @@ class offload:
680
680
  if (budgets!= None or budget >0) :
681
681
  self.async_transfers = True
682
682
 
683
- #pinInRAM = True
683
+ pinInRAM = True
684
684
  # compile not working yet or slower
685
- compile = False
685
+ compile = False # True
686
686
  #quantizeTransformer = False
687
687
  #self.async_transfers = False
688
688
  self.compile = compile
@@ -804,9 +804,6 @@ class offload:
804
804
  p._data = p._data.pin_memory()
805
805
  # fix quanto bug (that seems to have been fixed since&) that allows _scale to be float32 if the original weight was float32
806
806
  # (this may cause type mismatch between dequantified bfloat16 weights and float32 scales)
807
- if p._scale.dtype == torch.float32:
808
- pass
809
-
810
807
  p._scale = p._scale.to(torch.bfloat16).pin_memory() if p._scale.dtype == torch.float32 else p._scale.pin_memory()
811
808
  pinned_parameters_data[p]=[p._data, p._scale]
812
809
  else:
@@ -872,13 +869,13 @@ class offload:
872
869
  # we limit this check to the first level of blocks as quering the cuda cache is time consuming
873
870
  self.hook_me_light(submodule, model_id, cur_blocks_name, submodule_method, context = submodule_name)
874
871
 
875
- # if compile and cur_blocks_name != None and model_id == "transformer" and "_blocks" in submodule_name:
876
- # submodule.compile(mode="reduce-overhead" ) #mode= "max-autotune"
872
+ if compile and cur_blocks_name != None and model_id == "transformer" and "_blocks" in submodule_name:
873
+ submodule.compile(mode="reduce-overhead" ) #mode= "max-autotune"
877
874
 
878
875
  current_size = self.add_module_to_blocks(model_id, cur_blocks_name, submodule, prev_blocks_name)
879
876
 
880
877
 
881
- if compile:
878
+ if compile and False:
882
879
  if verboseLevel>=1:
883
880
  print("Torch compilation started")
884
881
  torch._dynamo.config.cache_size_limit = 10000
@@ -943,7 +940,7 @@ class offload:
943
940
  info = "You have chosen a Medium speed profile that requires at least 32 GB of RAM and 24 GB of VRAM."
944
941
  return offload.all(pipe_or_dict_of_modules, pinInRAM= "transformer", modelsToQuantize= extra_mod_to_quantize , info = info, quantizeTransformer= quantizeTransformer)
945
942
  elif profile_no == profile_type.LowRAM_LowVRAM_Slow:
946
- info = "You have chosen the Slowest profile that requires at least 32 GB of RAM and 12 GB of VRAM."
943
+ info = "You have chosen the Slow profile that requires at least 32 GB of RAM and 12 GB of VRAM."
947
944
  return offload.all(pipe_or_dict_of_modules, pinInRAM= "transformer", modelsToQuantize= extra_mod_to_quantize , budgets=budgets, info = info, quantizeTransformer= quantizeTransformer)
948
945
  elif profile_no == profile_type.VerylowRAM_LowVRAM_Slowest:
949
946
  budgets["transformer"] = 400
File without changes
File without changes
File without changes
File without changes