mmgp 3.2.1__tar.gz → 3.2.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of mmgp might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.2
2
2
  Name: mmgp
3
- Version: 3.2.1
3
+ Version: 3.2.3
4
4
  Summary: Memory Management for the GPU Poor
5
5
  Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
6
  License: GNU GENERAL PUBLIC LICENSE
@@ -17,7 +17,7 @@ Requires-Dist: peft
17
17
 
18
18
 
19
19
  <p align="center">
20
- <H2>Memory Management 3.2.0 for the GPU Poor by DeepBeepMeep</H2>
20
+ <H2>Memory Management 3.2.3 for the GPU Poor by DeepBeepMeep</H2>
21
21
  </p>
22
22
 
23
23
 
@@ -44,6 +44,9 @@ Each profile may use a combination of the following:
44
44
 
45
45
  ## Sample applications that use mmgp
46
46
  It is recommended to have a look at these applications to see how mmgp was implemented in each of them:
47
+ - Wan2GP: https://github.com/deepbeepmeep/Wan2GP :\
48
+ An excellent text to video and image to video generator by Alibaba
49
+
47
50
  - Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP :\
48
51
  A great image to 3D and text to 3D tool by the Tencent team. Thanks to mmgp it can run with less than 6 GB of VRAM
49
52
 
@@ -116,9 +119,9 @@ For example:
116
119
  - pinnedMemory: Boolean (for all models) or List of models ids to pin to RAM. Every model pinned to RAM will load much faster (up to 2 times) but this requires more RAM
117
120
  - quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
118
121
  - extraModelsToQuantize: list of additional modelids of models to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
119
- - budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
122
+ - budgets: either a number in mega bytes, (for all models, if 0 unlimited budget) a string that is perecentage of the total VRAM or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
120
123
  The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
121
- - workingVRAM: either a number in mega bytes or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
124
+ - workingVRAM: either a number in mega bytes, a string that is perecentage of the total VRAM or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
122
125
  - asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
123
126
  - verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
124
127
  - compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
@@ -1,6 +1,6 @@
1
1
 
2
2
  <p align="center">
3
- <H2>Memory Management 3.2.0 for the GPU Poor by DeepBeepMeep</H2>
3
+ <H2>Memory Management 3.2.3 for the GPU Poor by DeepBeepMeep</H2>
4
4
  </p>
5
5
 
6
6
 
@@ -27,6 +27,9 @@ Each profile may use a combination of the following:
27
27
 
28
28
  ## Sample applications that use mmgp
29
29
  It is recommended to have a look at these applications to see how mmgp was implemented in each of them:
30
+ - Wan2GP: https://github.com/deepbeepmeep/Wan2GP :\
31
+ An excellent text to video and image to video generator by Alibaba
32
+
30
33
  - Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP :\
31
34
  A great image to 3D and text to 3D tool by the Tencent team. Thanks to mmgp it can run with less than 6 GB of VRAM
32
35
 
@@ -99,9 +102,9 @@ For example:
99
102
  - pinnedMemory: Boolean (for all models) or List of models ids to pin to RAM. Every model pinned to RAM will load much faster (up to 2 times) but this requires more RAM
100
103
  - quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
101
104
  - extraModelsToQuantize: list of additional modelids of models to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
102
- - budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
105
+ - budgets: either a number in mega bytes, (for all models, if 0 unlimited budget) a string that is perecentage of the total VRAM or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
103
106
  The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
104
- - workingVRAM: either a number in mega bytes or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
107
+ - workingVRAM: either a number in mega bytes, a string that is perecentage of the total VRAM or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
105
108
  - asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
106
109
  - verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
107
110
  - compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "mmgp"
3
- version = "3.2.1"
3
+ version = "3.2.3"
4
4
  authors = [
5
5
  { name = "deepbeepmeep", email = "deepbeepmeep@yahoo.com" },
6
6
  ]
@@ -1,4 +1,4 @@
1
- # ------------------ Memory Management 3.2.1 for the GPU Poor by DeepBeepMeep (mmgp)------------------
1
+ # ------------------ Memory Management 3.2.3 for the GPU Poor by DeepBeepMeep (mmgp)------------------
2
2
  #
3
3
  # This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
4
4
  # This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
@@ -479,7 +479,7 @@ def _welcome():
479
479
  if welcome_displayed:
480
480
  return
481
481
  welcome_displayed = True
482
- print(f"{BOLD}{HEADER}************ Memory Management for the GPU Poor (mmgp 3.2.1) by DeepBeepMeep ************{ENDC}{UNBOLD}")
482
+ print(f"{BOLD}{HEADER}************ Memory Management for the GPU Poor (mmgp 3.2.3) by DeepBeepMeep ************{ENDC}{UNBOLD}")
483
483
 
484
484
  def _extract_num_from_str(num_in_str):
485
485
  size = len(num_in_str)
@@ -1457,11 +1457,10 @@ class offload:
1457
1457
  if tied_param != None:
1458
1458
  setattr( tied_param[0], tied_param[1], q)
1459
1459
  del p, q
1460
- any_past_block = False
1461
1460
 
1462
1461
  loaded_block = self.loaded_blocks[model_id]
1462
+
1463
1463
  if not preload and loaded_block != None:
1464
- any_past_block = True
1465
1464
  self.gpu_unload_blocks(model_id, loaded_block)
1466
1465
  if self.ready_to_check_mem():
1467
1466
  self.empty_cache_if_needed()
@@ -1475,7 +1474,8 @@ class offload:
1475
1474
 
1476
1475
 
1477
1476
  if self.async_transfers and blocks_name != None:
1478
- first = self.prev_blocks_names[entry_name] == None or not any_past_block
1477
+ prev = self.prev_blocks_names[entry_name]
1478
+ first = prev == None or prev != loaded_block
1479
1479
  next_blocks_entry = self.next_blocks_names[entry_name] if entry_name in self.next_blocks_names else None
1480
1480
  if first:
1481
1481
  if self.verboseLevel >=2:
@@ -1497,7 +1497,6 @@ class offload:
1497
1497
  print(f"Loading model {entry_name} ({model_name}) in GPU")
1498
1498
  cpu_to_gpu(self.default_stream, self.blocks_of_modules[entry_name])
1499
1499
  torch.cuda.synchronize()
1500
-
1501
1500
  if not preload:
1502
1501
  self.loaded_blocks[model_id] = blocks_name
1503
1502
 
@@ -1710,7 +1709,7 @@ class offload:
1710
1709
  current_budget -= base_size
1711
1710
  if current_budget <= 0:
1712
1711
  if self.verboseLevel >=1:
1713
- print(f"Async loading plan for model '{model_id}' : due to limited budget, beside the async shuttle only only base model ({(base_size)/ONE_MB:0.2f} MB) will be preloaded")
1712
+ print(f"Async loading plan for model '{model_id}' : minimum budget management, beside the async shuttle only base model ({(base_size)/ONE_MB:0.2f} MB) will be preloaded")
1714
1713
  return
1715
1714
 
1716
1715
  towers = []
@@ -1732,7 +1731,7 @@ class offload:
1732
1731
  current_budget -= 2 * max_floor_size
1733
1732
  if current_budget <= 0:
1734
1733
  if self.verboseLevel >=1:
1735
- print(f"Async loading plan for model '{model_id}' : due to limited budget, beside the async shuttle only the base model ({(base_size)/ONE_MB:0.2f} MB) will be preloaded")
1734
+ print(f"Async loading plan for model '{model_id}' : minimum budget management, beside the async shuttle only the base model ({(base_size)/ONE_MB:0.2f} MB) will be preloaded")
1736
1735
  return
1737
1736
 
1738
1737
 
@@ -1743,7 +1742,7 @@ class offload:
1743
1742
  max_blocks_fetch = max(max_floor_size, max_blocks_fetch)
1744
1743
  if preload_blocks_count <= 0:
1745
1744
  if self.verboseLevel >=1:
1746
- print(f"Async loading plan for model '{model_id}' : due to limited budget, beside the async shuttle only the base model ({(base_size)/ONE_MB:0.2f} MB) will be preloaded")
1745
+ print(f"Async loading plan for model '{model_id}' : minimum budget management, beside the async shuttle only the base model ({(base_size)/ONE_MB:0.2f} MB) will be preloaded")
1747
1746
  return
1748
1747
 
1749
1748
  nb_blocks= len(floors)
@@ -1821,16 +1820,20 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
1821
1820
 
1822
1821
  windows_os = os.name == 'nt'
1823
1822
 
1823
+ def get_parsed_budget(b):
1824
+ if isinstance(b , str) and b.endswith("%"):
1825
+ return float(b[:-1]) * self.device_mem_capacity
1826
+ else:
1827
+ return b * ONE_MB
1828
+
1824
1829
  budget = 0
1825
1830
  if not budgets is None:
1826
1831
  if isinstance(budgets , dict):
1827
- model_budgets = budgets
1828
- budget = budgets.get("*", 0) * ONE_MB
1832
+ model_budgets = { k : get_parsed_budget(b) for k , b in budgets.items() }
1833
+ budget = model_budgets.get("*", 0)
1829
1834
  else:
1830
- budget = int(budgets) * ONE_MB
1835
+ budget = get_parsed_budget(budget)
1831
1836
 
1832
- # if (budgets!= None or budget >0) :
1833
- # self.async_transfers = True
1834
1837
  self.async_transfers = asyncTransfers
1835
1838
 
1836
1839
 
@@ -1938,18 +1941,19 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
1938
1941
  estimatesBytesToPin += current_model_size
1939
1942
 
1940
1943
 
1941
- model_budget = model_budgets[model_id] * ONE_MB if model_id in model_budgets else budget
1944
+ model_budget = model_budgets[model_id] if model_id in model_budgets else budget
1942
1945
  if workingVRAM != None:
1943
1946
  model_minimumVRAM = -1
1944
1947
  if isinstance(workingVRAM, dict):
1945
1948
  if model_id in workingVRAM:
1946
- model_minimumVRAM = workingVRAM[model_id]
1949
+ model_minimumVRAM = get_parsed_budget(workingVRAM[model_id])
1947
1950
  elif "*" in model_id in workingVRAM:
1948
- model_minimumVRAM = workingVRAM["*"]
1951
+ model_minimumVRAM = get_parsed_budget(workingVRAM["*"])
1949
1952
  else:
1950
- model_minimumVRAM = workingVRAM
1953
+ model_minimumVRAM = get_parsed_budget(workingVRAM)
1954
+
1951
1955
  if model_minimumVRAM > 0:
1952
- new_budget = self.device_mem_capacity - model_minimumVRAM * ONE_MB
1956
+ new_budget = self.device_mem_capacity - model_minimumVRAM
1953
1957
  new_budget = 1 if new_budget < 0 else new_budget
1954
1958
  model_budget = new_budget if model_budget == 0 or new_budget < model_budget else model_budget
1955
1959
  if model_budget > 0 and model_budget > current_model_size:
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.2
2
2
  Name: mmgp
3
- Version: 3.2.1
3
+ Version: 3.2.3
4
4
  Summary: Memory Management for the GPU Poor
5
5
  Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
6
6
  License: GNU GENERAL PUBLIC LICENSE
@@ -17,7 +17,7 @@ Requires-Dist: peft
17
17
 
18
18
 
19
19
  <p align="center">
20
- <H2>Memory Management 3.2.0 for the GPU Poor by DeepBeepMeep</H2>
20
+ <H2>Memory Management 3.2.3 for the GPU Poor by DeepBeepMeep</H2>
21
21
  </p>
22
22
 
23
23
 
@@ -44,6 +44,9 @@ Each profile may use a combination of the following:
44
44
 
45
45
  ## Sample applications that use mmgp
46
46
  It is recommended to have a look at these applications to see how mmgp was implemented in each of them:
47
+ - Wan2GP: https://github.com/deepbeepmeep/Wan2GP :\
48
+ An excellent text to video and image to video generator by Alibaba
49
+
47
50
  - Hunyuan3D-2GP: https://github.com/deepbeepmeep/Hunyuan3D-2GP :\
48
51
  A great image to 3D and text to 3D tool by the Tencent team. Thanks to mmgp it can run with less than 6 GB of VRAM
49
52
 
@@ -116,9 +119,9 @@ For example:
116
119
  - pinnedMemory: Boolean (for all models) or List of models ids to pin to RAM. Every model pinned to RAM will load much faster (up to 2 times) but this requires more RAM
117
120
  - quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
118
121
  - extraModelsToQuantize: list of additional modelids of models to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
119
- - budgets: either a number in mega bytes (for all models, if 0 unlimited budget) or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
122
+ - budgets: either a number in mega bytes, (for all models, if 0 unlimited budget) a string that is perecentage of the total VRAM or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
120
123
  The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
121
- - workingVRAM: either a number in mega bytes or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
124
+ - workingVRAM: either a number in mega bytes, a string that is perecentage of the total VRAM or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
122
125
  - asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
123
126
  - verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
124
127
  - compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes