Synchronization Functions

Currently, only several of the SYCL synchronization operations are supported. For synchronization of all threads in the same thread block, a helper function called numba_dpex.barrier() is provided. This function implements the same pattern as barriers in traditional multi-threaded programming: invoking the function forces a thread to wait until all threads in the block reach the barrier, at which point it returns control to all its callers.

numba_dpex.barrier() supports two memory fence options:

  • numba_dpex.CLK_GLOBAL_MEM_FENCE: The barrier function will queue a memory fence to ensure correct ordering of memory operations to global memory. Using the option can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data. Passing no arguments to numba_dpex.barrier() is equivalent to setting the global memory fence option. For example,

    def no_arg_barrier_support():
        """
        This example demonstrates the usage of numba_dpex's ``barrier``
        intrinsic function. The ``barrier`` function is usable only inside
        a ``kernel`` and is equivalent to OpenCL's ``barrier`` function.
        """
    
        @dppy.kernel
        def twice(A):
            i = dppy.get_global_id(0)
            d = A[i]
            # no argument defaults to global mem fence
            dppy.barrier()
            A[i] = d * 2
    
        N = 10
        arr = np.arange(N).astype(np.float32)
        print(arr)
    
        # Use the environment variable SYCL_DEVICE_FILTER to change the default device.
        # See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
        device = dpctl.select_default_device()
        print("Using device ...")
        device.print_device_info()
    
        with dpctl.device_context(device):
            twice[N, dppy.DEFAULT_LOCAL_SIZE](arr)
    
        # the output should be `arr * 2, i.e. [0, 2, 4, 6, ...]`
        print(arr)
    
  • numba_dpex.CLK_LOCAL_MEM_FENCE: The barrier function will either flush any variables stored in local memory or queue a memory fence to ensure correct ordering of memory operations to local memory. For example,

def local_memory():
    """
    This example demonstrates the usage of numba-dpex's `local.array`
    intrinsic function. The function is used to create a static array
    allocated on the devices local address space.
    """
    blocksize = 10

    @dppy.kernel
    def reverse_array(A):
        lm = dppy.local.array(shape=10, dtype=float32)
        i = dppy.get_global_id(0)

        # preload
        lm[i] = A[i]
        # barrier local or global will both work as we only have one work group
        dppy.barrier(dppy.CLK_LOCAL_MEM_FENCE)  # local mem fence
        # write
        A[i] += lm[blocksize - 1 - i]

    arr = np.arange(blocksize).astype(np.float32)
    print(arr)

    # Use the environment variable SYCL_DEVICE_FILTER to change the default device.
    # See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
    device = dpctl.select_default_device()
    print("Using device ...")
    device.print_device_info()

    with dpctl.device_context(device):
        reverse_array[blocksize, dppy.DEFAULT_LOCAL_SIZE](arr)

    # the output should be `orig[::-1] + orig, i.e. [9, 9, 9, ...]``
    print(arr)

Note

The numba_dpex.barrier() function is semantically equivalent to numba.cuda.syncthreads.