mmgp 3.2.2__tar.gz → 3.2.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of mmgp might be problematic. Click here for more details.
- {mmgp-3.2.2/src/mmgp.egg-info → mmgp-3.2.3}/PKG-INFO +4 -4
- {mmgp-3.2.2 → mmgp-3.2.3}/README.md +3 -3
- {mmgp-3.2.2 → mmgp-3.2.3}/pyproject.toml +1 -1
- {mmgp-3.2.2 → mmgp-3.2.3}/src/mmgp/offload.py +23 -19
- {mmgp-3.2.2 → mmgp-3.2.3/src/mmgp.egg-info}/PKG-INFO +4 -4
- {mmgp-3.2.2 → mmgp-3.2.3}/LICENSE.md +0 -0
- {mmgp-3.2.2 → mmgp-3.2.3}/setup.cfg +0 -0
- {mmgp-3.2.2 → mmgp-3.2.3}/src/__init__.py +0 -0
- {mmgp-3.2.2 → mmgp-3.2.3}/src/mmgp/__init__.py +0 -0
- {mmgp-3.2.2 → mmgp-3.2.3}/src/mmgp/safetensors2.py +0 -0
- {mmgp-3.2.2 → mmgp-3.2.3}/src/mmgp.egg-info/SOURCES.txt +0 -0
- {mmgp-3.2.2 → mmgp-3.2.3}/src/mmgp.egg-info/dependency_links.txt +0 -0
- {mmgp-3.2.2 → mmgp-3.2.3}/src/mmgp.egg-info/requires.txt +0 -0
- {mmgp-3.2.2 → mmgp-3.2.3}/src/mmgp.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.2
|
|
2
2
|
Name: mmgp
|
|
3
|
-
Version: 3.2.
|
|
3
|
+
Version: 3.2.3
|
|
4
4
|
Summary: Memory Management for the GPU Poor
|
|
5
5
|
Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
|
|
6
6
|
License: GNU GENERAL PUBLIC LICENSE
|
|
@@ -17,7 +17,7 @@ Requires-Dist: peft
|
|
|
17
17
|
|
|
18
18
|
|
|
19
19
|
<p align="center">
|
|
20
|
-
<H2>Memory Management 3.2.
|
|
20
|
+
<H2>Memory Management 3.2.3 for the GPU Poor by DeepBeepMeep</H2>
|
|
21
21
|
</p>
|
|
22
22
|
|
|
23
23
|
|
|
@@ -119,9 +119,9 @@ For example:
|
|
|
119
119
|
- pinnedMemory: Boolean (for all models) or List of models ids to pin to RAM. Every model pinned to RAM will load much faster (up to 2 times) but this requires more RAM
|
|
120
120
|
- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
|
|
121
121
|
- extraModelsToQuantize: list of additional modelids of models to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
|
|
122
|
-
- budgets: either a number in mega bytes
|
|
122
|
+
- budgets: either a number in mega bytes, (for all models, if 0 unlimited budget) a string that is perecentage of the total VRAM or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
|
|
123
123
|
The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
|
|
124
|
-
- workingVRAM: either a number in mega bytes or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
|
|
124
|
+
- workingVRAM: either a number in mega bytes, a string that is perecentage of the total VRAM or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
|
|
125
125
|
- asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
|
|
126
126
|
- verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
|
|
127
127
|
- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
|
|
2
2
|
<p align="center">
|
|
3
|
-
<H2>Memory Management 3.2.
|
|
3
|
+
<H2>Memory Management 3.2.3 for the GPU Poor by DeepBeepMeep</H2>
|
|
4
4
|
</p>
|
|
5
5
|
|
|
6
6
|
|
|
@@ -102,9 +102,9 @@ For example:
|
|
|
102
102
|
- pinnedMemory: Boolean (for all models) or List of models ids to pin to RAM. Every model pinned to RAM will load much faster (up to 2 times) but this requires more RAM
|
|
103
103
|
- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
|
|
104
104
|
- extraModelsToQuantize: list of additional modelids of models to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
|
|
105
|
-
- budgets: either a number in mega bytes
|
|
105
|
+
- budgets: either a number in mega bytes, (for all models, if 0 unlimited budget) a string that is perecentage of the total VRAM or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
|
|
106
106
|
The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
|
|
107
|
-
- workingVRAM: either a number in mega bytes or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
|
|
107
|
+
- workingVRAM: either a number in mega bytes, a string that is perecentage of the total VRAM or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
|
|
108
108
|
- asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
|
|
109
109
|
- verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
|
|
110
110
|
- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
|
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
# ------------------ Memory Management 3.2.
|
|
1
|
+
# ------------------ Memory Management 3.2.3 for the GPU Poor by DeepBeepMeep (mmgp)------------------
|
|
2
2
|
#
|
|
3
3
|
# This module contains multiples optimisations so that models such as Flux (and derived), Mochi, CogView, HunyuanVideo, ... can run smoothly on a 24 GB GPU limited card.
|
|
4
4
|
# This a replacement for the accelerate library that should in theory manage offloading, but doesn't work properly with models that are loaded / unloaded several
|
|
@@ -479,7 +479,7 @@ def _welcome():
|
|
|
479
479
|
if welcome_displayed:
|
|
480
480
|
return
|
|
481
481
|
welcome_displayed = True
|
|
482
|
-
print(f"{BOLD}{HEADER}************ Memory Management for the GPU Poor (mmgp 3.2.
|
|
482
|
+
print(f"{BOLD}{HEADER}************ Memory Management for the GPU Poor (mmgp 3.2.3) by DeepBeepMeep ************{ENDC}{UNBOLD}")
|
|
483
483
|
|
|
484
484
|
def _extract_num_from_str(num_in_str):
|
|
485
485
|
size = len(num_in_str)
|
|
@@ -1457,11 +1457,10 @@ class offload:
|
|
|
1457
1457
|
if tied_param != None:
|
|
1458
1458
|
setattr( tied_param[0], tied_param[1], q)
|
|
1459
1459
|
del p, q
|
|
1460
|
-
any_past_block = False
|
|
1461
1460
|
|
|
1462
1461
|
loaded_block = self.loaded_blocks[model_id]
|
|
1462
|
+
|
|
1463
1463
|
if not preload and loaded_block != None:
|
|
1464
|
-
any_past_block = True
|
|
1465
1464
|
self.gpu_unload_blocks(model_id, loaded_block)
|
|
1466
1465
|
if self.ready_to_check_mem():
|
|
1467
1466
|
self.empty_cache_if_needed()
|
|
@@ -1475,7 +1474,8 @@ class offload:
|
|
|
1475
1474
|
|
|
1476
1475
|
|
|
1477
1476
|
if self.async_transfers and blocks_name != None:
|
|
1478
|
-
|
|
1477
|
+
prev = self.prev_blocks_names[entry_name]
|
|
1478
|
+
first = prev == None or prev != loaded_block
|
|
1479
1479
|
next_blocks_entry = self.next_blocks_names[entry_name] if entry_name in self.next_blocks_names else None
|
|
1480
1480
|
if first:
|
|
1481
1481
|
if self.verboseLevel >=2:
|
|
@@ -1497,7 +1497,6 @@ class offload:
|
|
|
1497
1497
|
print(f"Loading model {entry_name} ({model_name}) in GPU")
|
|
1498
1498
|
cpu_to_gpu(self.default_stream, self.blocks_of_modules[entry_name])
|
|
1499
1499
|
torch.cuda.synchronize()
|
|
1500
|
-
|
|
1501
1500
|
if not preload:
|
|
1502
1501
|
self.loaded_blocks[model_id] = blocks_name
|
|
1503
1502
|
|
|
@@ -1710,7 +1709,7 @@ class offload:
|
|
|
1710
1709
|
current_budget -= base_size
|
|
1711
1710
|
if current_budget <= 0:
|
|
1712
1711
|
if self.verboseLevel >=1:
|
|
1713
|
-
print(f"Async loading plan for model '{model_id}' :
|
|
1712
|
+
print(f"Async loading plan for model '{model_id}' : minimum budget management, beside the async shuttle only base model ({(base_size)/ONE_MB:0.2f} MB) will be preloaded")
|
|
1714
1713
|
return
|
|
1715
1714
|
|
|
1716
1715
|
towers = []
|
|
@@ -1732,7 +1731,7 @@ class offload:
|
|
|
1732
1731
|
current_budget -= 2 * max_floor_size
|
|
1733
1732
|
if current_budget <= 0:
|
|
1734
1733
|
if self.verboseLevel >=1:
|
|
1735
|
-
print(f"Async loading plan for model '{model_id}' :
|
|
1734
|
+
print(f"Async loading plan for model '{model_id}' : minimum budget management, beside the async shuttle only the base model ({(base_size)/ONE_MB:0.2f} MB) will be preloaded")
|
|
1736
1735
|
return
|
|
1737
1736
|
|
|
1738
1737
|
|
|
@@ -1743,7 +1742,7 @@ class offload:
|
|
|
1743
1742
|
max_blocks_fetch = max(max_floor_size, max_blocks_fetch)
|
|
1744
1743
|
if preload_blocks_count <= 0:
|
|
1745
1744
|
if self.verboseLevel >=1:
|
|
1746
|
-
print(f"Async loading plan for model '{model_id}' :
|
|
1745
|
+
print(f"Async loading plan for model '{model_id}' : minimum budget management, beside the async shuttle only the base model ({(base_size)/ONE_MB:0.2f} MB) will be preloaded")
|
|
1747
1746
|
return
|
|
1748
1747
|
|
|
1749
1748
|
nb_blocks= len(floors)
|
|
@@ -1821,16 +1820,20 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
|
|
|
1821
1820
|
|
|
1822
1821
|
windows_os = os.name == 'nt'
|
|
1823
1822
|
|
|
1823
|
+
def get_parsed_budget(b):
|
|
1824
|
+
if isinstance(b , str) and b.endswith("%"):
|
|
1825
|
+
return float(b[:-1]) * self.device_mem_capacity
|
|
1826
|
+
else:
|
|
1827
|
+
return b * ONE_MB
|
|
1828
|
+
|
|
1824
1829
|
budget = 0
|
|
1825
1830
|
if not budgets is None:
|
|
1826
1831
|
if isinstance(budgets , dict):
|
|
1827
|
-
model_budgets = budgets
|
|
1828
|
-
budget =
|
|
1832
|
+
model_budgets = { k : get_parsed_budget(b) for k , b in budgets.items() }
|
|
1833
|
+
budget = model_budgets.get("*", 0)
|
|
1829
1834
|
else:
|
|
1830
|
-
budget =
|
|
1835
|
+
budget = get_parsed_budget(budget)
|
|
1831
1836
|
|
|
1832
|
-
# if (budgets!= None or budget >0) :
|
|
1833
|
-
# self.async_transfers = True
|
|
1834
1837
|
self.async_transfers = asyncTransfers
|
|
1835
1838
|
|
|
1836
1839
|
|
|
@@ -1938,18 +1941,19 @@ def all(pipe_or_dict_of_modules, pinnedMemory = False, quantizeTransformer = Tru
|
|
|
1938
1941
|
estimatesBytesToPin += current_model_size
|
|
1939
1942
|
|
|
1940
1943
|
|
|
1941
|
-
model_budget = model_budgets[model_id]
|
|
1944
|
+
model_budget = model_budgets[model_id] if model_id in model_budgets else budget
|
|
1942
1945
|
if workingVRAM != None:
|
|
1943
1946
|
model_minimumVRAM = -1
|
|
1944
1947
|
if isinstance(workingVRAM, dict):
|
|
1945
1948
|
if model_id in workingVRAM:
|
|
1946
|
-
model_minimumVRAM = workingVRAM[model_id]
|
|
1949
|
+
model_minimumVRAM = get_parsed_budget(workingVRAM[model_id])
|
|
1947
1950
|
elif "*" in model_id in workingVRAM:
|
|
1948
|
-
model_minimumVRAM = workingVRAM["*"]
|
|
1951
|
+
model_minimumVRAM = get_parsed_budget(workingVRAM["*"])
|
|
1949
1952
|
else:
|
|
1950
|
-
model_minimumVRAM = workingVRAM
|
|
1953
|
+
model_minimumVRAM = get_parsed_budget(workingVRAM)
|
|
1954
|
+
|
|
1951
1955
|
if model_minimumVRAM > 0:
|
|
1952
|
-
new_budget = self.device_mem_capacity - model_minimumVRAM
|
|
1956
|
+
new_budget = self.device_mem_capacity - model_minimumVRAM
|
|
1953
1957
|
new_budget = 1 if new_budget < 0 else new_budget
|
|
1954
1958
|
model_budget = new_budget if model_budget == 0 or new_budget < model_budget else model_budget
|
|
1955
1959
|
if model_budget > 0 and model_budget > current_model_size:
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.2
|
|
2
2
|
Name: mmgp
|
|
3
|
-
Version: 3.2.
|
|
3
|
+
Version: 3.2.3
|
|
4
4
|
Summary: Memory Management for the GPU Poor
|
|
5
5
|
Author-email: deepbeepmeep <deepbeepmeep@yahoo.com>
|
|
6
6
|
License: GNU GENERAL PUBLIC LICENSE
|
|
@@ -17,7 +17,7 @@ Requires-Dist: peft
|
|
|
17
17
|
|
|
18
18
|
|
|
19
19
|
<p align="center">
|
|
20
|
-
<H2>Memory Management 3.2.
|
|
20
|
+
<H2>Memory Management 3.2.3 for the GPU Poor by DeepBeepMeep</H2>
|
|
21
21
|
</p>
|
|
22
22
|
|
|
23
23
|
|
|
@@ -119,9 +119,9 @@ For example:
|
|
|
119
119
|
- pinnedMemory: Boolean (for all models) or List of models ids to pin to RAM. Every model pinned to RAM will load much faster (up to 2 times) but this requires more RAM
|
|
120
120
|
- quantizeTransformer: boolean by default True. The 'transformer' model in the pipe contains usually the video or image generator is by defaut; quantized on the fly by default to 8 bits. If you want to save time on disk and reduce the loading time, you may want to load directly a prequantized model. If you don't want to quantize the image generator, you need to set the option *quantizeTransformer* to *False* to turn off on the fly quantization.
|
|
121
121
|
- extraModelsToQuantize: list of additional modelids of models to quantize on the fly. If the corresponding model is already quantized, this option will be ignored.
|
|
122
|
-
- budgets: either a number in mega bytes
|
|
122
|
+
- budgets: either a number in mega bytes, (for all models, if 0 unlimited budget) a string that is perecentage of the total VRAM or a dictionary that maps model ids to mega bytes : define the approximate budget in mega bytes that is allocated in VRAM for a model. Try not to allocate all the available VRAM so that the rest can be used to process the data. To define the default value in the dictionary, you may add entry named "*".
|
|
123
123
|
The smaller this number, the more VRAM left for image data / longer video but also the slower because there will be lots of loading / unloading between the RAM and the VRAM. If model is too big to fit in a budget, it will be broken down in multiples parts that will be unloaded / loaded consequently. The speed of low budget can be increased (up to 2 times) by turning on the options pinnedMemory and asyncTransfers.
|
|
124
|
-
- workingVRAM: either a number in mega bytes or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
|
|
124
|
+
- workingVRAM: either a number in mega bytes, a string that is perecentage of the total VRAM or a dictionary that maps a model ids to a number in mega bytes that corresponds to a minimum amount of VRAM that should be left for the data processed by the model. This number will prevail if it is in conflict with a too high budget defined for the same model.
|
|
125
125
|
- asyncTransfers: boolean, load to the GPU the next model part while the current part is being processed. This requires twice the budget if any is defined. This may increase speed by 20% (mostly visible on fast modern GPUs).
|
|
126
126
|
- verboseLevel: number between 0 and 2 (1 by default), provides various level of feedback of the different processes
|
|
127
127
|
- compile: list of model ids to compile, may accelerate up x2 depending on the type of GPU. It makes sense to compile only the model that is frequently used such as the "transformer" model in the case of video or image generation. Compilation requires Triton to be installed. Triton is available out of the box on Linux or WSL but requires to be installed with Windows: https://github.com/woct0rdho/triton-windows
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|