Memory management · CUDA.jl (2024)

Table of Contents

Type-preserving upload Unified memory Garbage collection Memory pool Memory limits Avoiding GC pressure Batching iterator FAQs References

A crucial aspect of working with a GPU is managing the data on it. The CuArray type is the primary interface for doing so: Creating a CuArray will allocate data on the GPU, copying elements to it will upload, and converting back to an Array will download values to the CPU:

# generate some data on the CPUcpu = rand(Float32, 1024)# allocate on the GPUgpu = CuArray{Float32}(undef, 1024)# copy from the CPU to the GPUcopyto!(gpu, cpu)# download and verify@test cpu == Array(gpu)

A shorter way to accomplish these operations is to call the copy constructor, i.e. CuArray(cpu).

Type-preserving upload

In many cases, you might not want to convert your input data to a dense CuArray. For example, with array wrappers you will want to preserve that wrapper type on the GPU and only upload the contained data. The Adapt.jl package does exactly that, and contains a list of rules on how to unpack and reconstruct types like array wrappers so that we can preserve the type when, e.g., uploading data to the GPU:

julia> cpu = Diagonal([1,2]) # wrapped data on the CPU2×2 Diagonal{Int64,Array{Int64,1}}: 1 ⋅ ⋅ 2julia> using Adaptjulia> gpu = adapt(CuArray, cpu) # upload to the GPU, keeping the wrapper intact2×2 Diagonal{Int64,CuArray{Int64,1,Nothing}}: 1 ⋅ ⋅ 2

Since this is a very common operation, the cu function conveniently does this for you:

julia> cu(cpu)2×2 Diagonal{Float32,CuArray{Float32,1,Nothing}}: 1.0 ⋅ ⋅ 2.0

Warning

The cu function is opinionated and converts input most floating-point scalars to Float32. This is often a good call, as Float64 and many other scalar types perform badly on the GPU. If this is unwanted, use adapt directly.

Unified memory

The CuArray constructor and the cu function default to allocating device memory, which can be accessed only from the GPU. It is also possible to allocate unified memory, which is accessible from both the CPU and GPU with the driver taking care of data movement:

julia> cpu = [1,2]2-element Vector{Int64}: 1 2julia> gpu = CuVector{Int,CUDA.UnifiedMemory}(cpu)2-element CuArray{Int64, 1, CUDA.UnifiedMemory}: 1 2julia> gpu = cu(cpu; unified=true)2-element CuArray{Int64, 1, CUDA.UnifiedMemory}: 1 2

Using unified memory has several advantages: it is possible to allocate more memory than the GPU has available, and the memory can be accessed efficiently from the CPU, either directly or by wrapping the CuArray using an Array:

julia> gpu[1] # no scalar indexing error!1julia> cpu_again = unsafe_wrap(Array, gpu)2-element Vector{Int64}: 1 2

This may make it significantly easier to port code to the GPU, as you can incrementally port parts of your application without having to worry about executing CPU code, or triggering an AbstractArray fallback. It may come at a cost however, as unified memory needs to be paged in and out of the GPU memory, and cannot be allocated asynchronously. To alleviate this cost, CUDA.jl automatically prefetches unified memory when passing it to a kernel.

On recent systems (CUDA 12.2 with the open-source NVIDIA driver) it is also possible to do the reverse, and access CPU memory from the GPU without having to explicitly allocate unified memory using the CuArray constructor or cu function:

julia> cpu = [1,2];julia> gpu = unsafe_wrap(CuArray, cpu)2-element CuArray{Int64, 1, CUDA.UnifiedMemory}: 1 2julia> gpu .+= 1;julia> cpu2-element Vector{Int64}: 2 3

Right now, CUDA.jl still defaults to allocating device memory, but this may change in the future. If you want to change the default behavior, you can set the default_memory preference to unified or host instead of device.

Garbage collection

Instances of the CuArray type are managed by the Julia garbage collector. This means that they will be collected once they are unreachable, and the memory hold by it will be repurposed or freed. There is no need for manual memory management, just make sure your objects are not reachable (i.e., there are no instances or references).

Memory pool

Behind the scenes, a memory pool will hold on to your objects and cache the underlying memory to speed up future allocations. As a result, your GPU might seem to be running out of memory while it isn't. When memory pressure is high, the pool will automatically free cached objects:

julia> CUDA.pool_status() # initial stateEffective GPU memory usage: 16.12% (2.537 GiB/15.744 GiB)Memory pool usage: 0 bytes (0 bytes reserved)julia> a = CuArray{Int}(undef, 1024); # allocate 8KBjulia> CUDA.pool_status()Effective GPU memory usage: 16.35% (2.575 GiB/15.744 GiB)Memory pool usage: 8.000 KiB (32.000 MiB reserved)julia> a = nothing; GC.gc(true)julia> CUDA.pool_status() # 8KB is now cachedEffective GPU memory usage: 16.34% (2.573 GiB/15.744 GiB)Memory pool usage: 0 bytes (32.000 MiB reserved)

If for some reason you need all cached memory to be reclaimed, call CUDA.reclaim():

julia> CUDA.reclaim()julia> CUDA.pool_status()Effective GPU memory usage: 16.17% (2.546 GiB/15.744 GiB)Memory pool usage: 0 bytes (0 bytes reserved)

Note

It should never be required to manually reclaim memory before performing any high-level GPU array operation: Functionality that allocates should itself call into the memory pool and free any cached memory if necessary. It is a bug if that operation runs into an out-of-memory situation only if not manually reclaiming memory beforehand.

Note

If you need to disable the memory pool, e.g. because of incompatibility with certain CUDA APIs, set the environment variable JULIA_CUDA_MEMORY_POOL to none before importing CUDA.jl.

Memory limits

If you're sharing a GPU with other users or applications, you might want to limit how much memory is used. By default, CUDA.jl will configure the memory pool to use all available device memory. You can change this using two environment variables:

JULIA_CUDA_SOFT_MEMORY_LIMIT: This is an advisory limit, used to configure the memory pool. If you set this to a nonzero value, the memory pool will attempt to release cached memory until memory use falls below this limit. Note that this only happens at specific synchronization points, so memory use may temporarily exceed this limit. In addition, this limit is incompatible with JULIA_CUDA_MEMORY_POOL=none.
JULIA_CUDA_HARD_MEMORY_LIMIT: This is a hard limit, checked before every allocation. On older versions of CUDA, before v12.2, this is a relatively expensive limit, so it is recommended to first try to use the soft limit.

The value of these variables can be formatted as a numer of bytes, optionally followed by a unit, or as a percentage of the total device memory. Examples: 100M, 50%, 1.5GiB, 10000.

Avoiding GC pressure

When your application performs a lot of memory operations, the time spent during GC might increase significantly. This happens more often than it does on the CPU because GPUs tend to have smaller memories and more frequently run out of it. When that happens, CUDA invokes the Julia garbage collector, which then needs to scan objects to see if they can be freed to get back some GPU memory.

To avoid having to depend on the Julia GC to free up memory, you can directly inform CUDA.jl when an allocation can be freed (or reused) by calling the unsafe_free! method. Once you've done so, you cannot use that array anymore:

julia> a = CuArray([1])1-element CuArray{Int64,1,Nothing}: 1julia> CUDA.unsafe_free!(a)julia> a1-element CuArray{Int64,1,Nothing}:Error showing value of type CuArray{Int64,1,Nothing}:ERROR: AssertionError: Use of freed memory

Batching iterator

If you are dealing with data sets that are too large to fit on the GPU all at once, you can use CuIterator to batch operations:

julia> batches = [([1], [2]), ([3], [4])]julia> for (batch, (a,b)) in enumerate(CuIterator(batches)) println("Batch $batch: ", a .+ b) endBatch 1: [3]Batch 2: [7]

For each batch, every argument (assumed to be an array-like) is uploaded to the GPU using the adapt mechanism from above. Afterwards, the memory is eagerly put back in the CUDA memory pool using unsafe_free! to lower GC pressure.

FAQs

How to solve CUDA memory error? ›

Now that we have a better understanding of the common causes of the 'CUDA out of memory' error, let's explore some solutions.

Reduce model size. ...
Reduce batch size. ...
Reduce data augmentation. ...
Optimize memory usage. ...
Use mixed precision training. ...
Gradient Accumulation.

Dec 12, 2023

Get More Info Here ›

How to free up CUDA GPU memory? ›

The recommended way is to delete the local variables (using del) first and then call torch. cuda. empty_cache(). Delete local variables first and then call torch.

Keep Reading ›

What is the maximum memory allocation for CUDA? ›

But static memory allocation has a restriction: one cannot allocate over 48KB of Shared Memory, it will lead to compile time error: “uses too much shared data”. To allocate memory a size is not known at compile time, one should use Dynamic Shared Memory. CUDA only supports a single dynamic allocation per block.

View Details ›

How do I check my CUDA memory? ›

To use the CUDA Memory Checker:

From the Nsight menu, select Options > CUDA. Change the setting for Enable Memory Checker from False (the default setting) to True. As an alternative, you can select the Memory Checker icon from the CUDA toolbar in order to enable memory checking.

Read On ›

How do I fix my GPU memory error? ›

If you run out of GPU memory during training, try these solutions and workarounds:

Reduce your mini-batch size until you no longer run out of memory during training. ...
Avoid using your GPU for other tasks. ...
Reduce the size of your training data. ...
Reduce the size of your network.

Read On ›

How do I fix memory error? ›

How do I fix the memory errors?

Replace the RAM modules (most common solution)
Set default or conservative RAM timings.
Increase the RAM voltage levels.
Decrease the CPU voltage levels.
Apply BIOS update to fix incompatibility issues.
Flag the address ranges as 'bad'

Show Me More ›

How do you clean GPU memory? ›

One simple and effective method to clear GPU memory is to restart your computer. When you restart your system, all processes, including those utilizing GPU memory, are terminated, and the memory is cleared. This can help resolve any temporary data buildup or fragmentation in the GPU memory.

Keep Reading ›

What happens when GPU RAM is full? ›

If a graphics card runs out of video memory, it needs to offload some data to the system memory, which can significantly impact performance. This situation often leads to stuttering, texture pop-ins, or even crashes in resource-intensive applications.

Discover More Details ›

What is the maximum constant memory of CUDA? ›

There is a total of 64K constant memory on a CUDA capable device. The constant memory is cached. For all threads of a half warp, reading from the constant cache, as long as all threads read the same address, is no slower than reading from a register.

Learn More Now ›

How do I limit CUDA memory usage? ›

If you encounter a CUDA OOM error, the steps you can take to reduce your memory usage are:

Reduce --batch-size.
Reduce --img-size.
Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s > YOLOv5n.
Train with multi-GPU at the same --batch-size.
Upgrade your hardware to a larger GPU.

More items...

Mar 27, 2022

Read The Full Story ›

What is constant memory in CUDA? ›

The constant memory in CUDA is a dedicated memory space of 65536 bytes. It is dedicated because it has some special features like cache and broadcasting. The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 1.

Find Out More ›

What is a CUDA memory pool? ›

The memory pool holds allocated blocks without freeing as much as possible. It makes the program hold most of the device memory, which may make other CUDA programs running in parallel out-of-memory situation.

Explore More ›

How to free CUDA memory? ›

Deleting all objects and references pointing to objects allocating GPU memory is the right approach and will free the memory. Calling empty_cache() will also clear the cache and free the memory (besides the memory used for the CUDA context).

Tell Me More ›

How do you increase the memory that the CUDA device has access to? ›

You can also try increasing the memory that the CUDA device has access to. You do this by modifying your system's GPU settings. Changing the configuration file or using command-line options frequently resolves the issue. Another option is to buy a new GPU.

Get More Info Here ›

How much shared memory does CUDA have? ›

Shared memory is a small amount of on-chip memory, ranging from 16 kilobytes to 48 kilobytes, depending on the configuration of the kernel.

Explore More ›

How do I fix CUDA runtime error? ›

How to resolve “RuntimeError: CUDA out of memory”?

One simplest fix is to reduce your batch_size . ...
In your Python script, add torch. ...
Run export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 before running your Python script.
Run nvidia-smi to monitor your GPU memory usage and kill the other processes using your GPU:

How to fix CUDA out of memory in Google Colab? ›

How to Solve the GPU Out of Memory Error

Method 1: Reduce the Batch Size. ...
Method 2: Reduce the Model Size. ...
Method 3: Use Mixed Precision Training. ...
Method 4: Use Gradient Checkpointing. ...
Method 5: Use Multiple GPUs. ...
Method 6: Use a Larger Memory GPU. ...
Method 7: Utilizing Google Colab Pro. ...
Method 8: Transfer Learning.

Dec 12, 2023

Discover More ›

How to give CUDA more memory? ›

Learn More Now ›

How to bypass supply memory error? ›

The only way to fix this problem is to replace the toner cartridge with brand a new one that isn't broken. Your printer manufacturer will regularly update your printer with new software of firmware. If the software is out of date for any reason, you may get an error.

Read The Full Story ›