Writing Data Parallel Kernels#

Kernel Declaration#

A kernel function is a device function that is meant to be called from host code, where a device can be any SYCL supported device such as a GPU, CPU, or an FPGA. The main characteristics of a kernel function are:

  • Scalars must be passed as an array. Kernels operate with dpnp array arguments only. If your want a scalar argument, then represent it as 0-dimensional dpnp array.

Note

Please refer to Data Parallel Extension for Numpy* to learn more about dpnp.

  • Kernels cannot explicitly return a value. All result data must be written to dpnp array passed as a function’s argument.

Here is an example of a kernel that computes sum of two vectors a and b. Arguments are two input vectors a and b and one output vector c for storing the result of vector summation:

EXAMPLE: Data parallel kernel implementing the vector sum a+b#
import numba_dpex as ndpx

# Data parallel kernel implementing vector sum
@ndpx.kernel
def kernel_vector_sum(a, b, c):
    i = ndpx.get_global_id(0)
    c[i] = a[i] + b[i]

Kernel Invocation#

The kernel launch parameter syntax for specifying global and local sizes are similar to SYCL’s range and ndrange classes. The global and local sizes need to be specified with numba_dpex’s Range and NdRange classes.

For example, below is a following kernel that computes a sum of two vectors:

EXAMPLE: A vector sum kernel#
import numba_dpex as ndpx

# Data parallel kernel implementing vector sum
@ndpx.kernel
def kernel_vector_sum(a, b, c):
    i = ndpx.get_global_id(0)
    c[i] = a[i] + b[i]

If the global size parameter is needed to run, it could be like this (where global_size is an int):

EXAMPLE: A vector sum kernel with a global size/range#
import numba_dpex as ndpx

# Utility function for printing and testing
def driver(a, b, c, global_size):
    kernel_vector_sum[ndpx.Range(global_size)](a, b, c)
    a_np = dpnp.asnumpy(a)  # Copy dpnp array a to NumPy array a_np
    b_np = dpnp.asnumpy(b)  # Copy dpnp array b to NumPy array b_np
    c_np = dpnp.asnumpy(c)  # Copy dpnp array c to NumPy array c_np
    testing.assert_equal(c_np, a_np + b_np)

If both local and global ranges are needed, they can be specified using two instances of Range inside an NdRange object. For example, below is a kernel to compute pair-wise Euclidean distances of n-dimensional data points:

EXAMPLE: A kernel to compute pair-wise Euclidean distances#
import numba_dpex as ndpx

@ndpx.kernel
def pairwise_distance(X, D, xshape0, xshape1):
    """
    An Euclidean pairwise distance computation implemented as
    a ``kernel`` function.
    """
    idx = ndpx.get_global_id(0)

    d0 = X[idx, 0] - X[idx, 0]
    # for i in range(xshape0):
    for j in range(X.shape[0]):
        d = d0
        for k in range(X.shape[1]):
            tmp = X[idx, k] - X[j, k]
            d += tmp * tmp
        D[idx, j] = sqrt(d)

Now the local and global sizes can be specified as follows (here both args.n and args.l are int):

EXAMPLE: A kernel to compute pair-wise Euclidean distances with a global and a local size/range#
import numba_dpex as ndpx

# Global work size is equal to the number of points
global_size = ndpx.Range(args.n)
# Local Work size is optional
local_size = ndpx.Range(args.l)

def driver():
    # measure running time
    times = list()
    for repeat in range(args.r):
        start = time()
        pairwise_distance[ndpx.NdRange(global_size, local_size)](
            X, D, X.shape[0], X.shape[1]
        )
        end = time()

        total_time = end - start
        times.append(total_time)

    return times

Kernel Indexing Functions#

In data parallel kernel programming all work items are enumerated and accessed by their index. You will use numba_dpex.get_global_id() function to get the index of a current work item from the kernel. The total number of work items can be determined by calling numba_dpex.get_global_size() function.

The work group size can be determined by calling numba_dpex.get_local_size() function. Work items in the current work group are accessed by calling numba_dpex.get_local_id().

The total number of work groups are determined by calling numba_dpex.get_num_groups() function. The current work group index is obtained by calling numba_dpex.get_group_id() function.