Memory Management
DPC++’s USM shared memory allocator (memory_alloc
) is used to
enable host to device and vice versa data transfer. By using USM shared
memory allocator, it allows seamless interoperability between
numba-dpex
and other SYCL-based Python extensions and across multiple
kernels written using numba_dpex.kernel
decorator.
USM memory manager provided by dpctl
and support for
the SYCL USM Array Interface protocol enable zero-copy data
exchange across USM memory-backed Python objects.
Note
USM pointers make sense within a SYCL context and can be of four allocation
types: host
, device
, shared
, or unknown
. Host applications,
including CPython interpreter, can work with USM pointers of type host
and shared
as if they were ordinary host pointers. Accessing device
USM pointers by host applications is not allowed.
SYCL USM Array Interface
A SYCL library may allocate USM memory for the result that needs to be passed to
Python. A native Python extension that makes use of such a library may expose
this memory as an instance of Python class that will implement memory management
logic (ensures that memory is freed when the instance is no longer needed).
The need to manage memory arises whenever a library uses a custom allocator.
For example, daal4py
uses Python capsule to ensure that a native
library-allocated memory is freed using the appropriate deallocator.
To enable native extensions to pass the memory allocated by a native SYCL
library to Numba, or another SYCL-aware Python extension without making a copy,
the class must provide __sycl_usm_array_interface__
attribute which
returns a Python dictionary with the following fields:
shape
: tuple ofint
typestr
:string
typedescr
: a list of tuplesdata
: (int
,bool
)strides
: tuple ofint
offset
:int
version
:int
syclobj
:dpctl.SyclQueue
ordpctl.SyclContext
object
The dictionary keys align with those of numpy.ndarray.__array_interface__
and __cuda_array_interface__
. For host accessible
USM pointers, the object may also implement CPython
PEP-3118
compliant buffer interface which will be used if a data
key is not present
in the dictionary. Use of a buffer interface extends the interoperability to
other Python objects, such as bytes
, bytearray
, array.array
, or
memoryview
. The type of the USM pointer stored in the object can be queried
using methods of the dpctl
.
Device-only memory and explicit data transfer
At the moment, there is no mechanism for the explicit transfer of arrays to the device and back. Please use usm arrays.
Local memory
In SYCL’s memory model, local memory is a contiguous region of memory allocated per work group and is visible to all the work items in that group. Local memory is device-only and cannot be accessed from the host. From the perspective offers the device, the local memory is exposed as a contiguous array of a specific types. The maximum available local memory is hardware-specific. The SYCL local memory concept is analogous to CUDA’s shared memory concept.
A special function dppy.local.array
is provided to
allocate local memory for a kernel.
def local_memory():
"""
This example demonstrates the usage of numba-dpex's `local.array`
intrinsic function. The function is used to create a static array
allocated on the devices local address space.
"""
blocksize = 10
@dppy.kernel
def reverse_array(A):
lm = dppy.local.array(shape=10, dtype=float32)
i = dppy.get_global_id(0)
# preload
lm[i] = A[i]
# barrier local or global will both work as we only have one work group
dppy.barrier(dppy.CLK_LOCAL_MEM_FENCE) # local mem fence
# write
A[i] += lm[blocksize - 1 - i]
arr = np.arange(blocksize).astype(np.float32)
print(arr)
# Use the environment variable SYCL_DEVICE_FILTER to change the default device.
# See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
device = dpctl.select_default_device()
print("Using device ...")
device.print_device_info()
with dpctl.device_context(device):
reverse_array[blocksize, dppy.DEFAULT_LOCAL_SIZE](arr)
# the output should be `orig[::-1] + orig, i.e. [9, 9, 9, ...]``
print(arr)
Note
To go convert from numba.cuda
to numba-dpex
, replace
numba.cuda.shared.array
with
numba_dpex.local.array(shape=blocksize, dtype=float32)
.
Todo
Add details about current limitations for local memory allocation.
Private and Constant memory
SYCL private and constant memory are not supported yet.