Synchronization Functions
Currently, only several of the SYCL synchronization operations are supported. For synchronization of all threads in the same thread block, a helper function called numba_dpex.barrier()
is provided. This function implements the same pattern as barriers in traditional
multi-threaded programming: invoking the function forces a thread to wait until
all threads in the block reach the barrier, at which point it returns control
to all its callers.
numba_dpex.barrier()
supports two memory fence options:
numba_dpex.CLK_GLOBAL_MEM_FENCE
: The barrier function will queue a memory fence to ensure correct ordering of memory operations to global memory. Using the option can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data. Passing no arguments tonumba_dpex.barrier()
is equivalent to setting the global memory fence option. For example,def no_arg_barrier_support(): """ This example demonstrates the usage of numba_dpex's ``barrier`` intrinsic function. The ``barrier`` function is usable only inside a ``kernel`` and is equivalent to OpenCL's ``barrier`` function. """ @dppy.kernel def twice(A): i = dppy.get_global_id(0) d = A[i] # no argument defaults to global mem fence dppy.barrier() A[i] = d * 2 N = 10 arr = np.arange(N).astype(np.float32) print(arr) # Use the environment variable SYCL_DEVICE_FILTER to change the default device. # See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter. device = dpctl.select_default_device() print("Using device ...") device.print_device_info() with dpctl.device_context(device): twice[N, dppy.DEFAULT_LOCAL_SIZE](arr) # the output should be `arr * 2, i.e. [0, 2, 4, 6, ...]` print(arr)
numba_dpex.CLK_LOCAL_MEM_FENCE
: The barrier function will either flush any variables stored in local memory or queue a memory fence to ensure correct ordering of memory operations to local memory. For example,
def local_memory():
"""
This example demonstrates the usage of numba-dpex's `local.array`
intrinsic function. The function is used to create a static array
allocated on the devices local address space.
"""
blocksize = 10
@dppy.kernel
def reverse_array(A):
lm = dppy.local.array(shape=10, dtype=float32)
i = dppy.get_global_id(0)
# preload
lm[i] = A[i]
# barrier local or global will both work as we only have one work group
dppy.barrier(dppy.CLK_LOCAL_MEM_FENCE) # local mem fence
# write
A[i] += lm[blocksize - 1 - i]
arr = np.arange(blocksize).astype(np.float32)
print(arr)
# Use the environment variable SYCL_DEVICE_FILTER to change the default device.
# See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
device = dpctl.select_default_device()
print("Using device ...")
device.print_device_info()
with dpctl.device_context(device):
reverse_array[blocksize, dppy.DEFAULT_LOCAL_SIZE](arr)
# the output should be `orig[::-1] + orig, i.e. [9, 9, 9, ...]``
print(arr)
Note
The numba_dpex.barrier()
function is semantically equivalent to
numba.cuda.syncthreads
.