Writing Data Parallel Kernels#
Kernel Declaration#
A kernel function is a device function that is meant to be called from host code, where a device can be any SYCL supported device such as a GPU, CPU, or an FPGA. The main characteristics of a kernel function are:
Scalars must be passed as an array. Kernels operate with
dpnparray arguments only. If your want a scalar argument, then represent it as 0-dimensionaldpnparray.
Note
Please refer to Data Parallel Extension for Numpy* to learn more about dpnp.
Kernels cannot explicitly return a value. All result data must be written to
dpnparray passed as a function’s argument.
Here is an example of a kernel that computes sum of two vectors a and b.
Arguments are two input vectors a and b and one output vector c for
storing the result of vector summation:
import numba_dpex as ndpx
# Data parallel kernel implementing vector sum
@ndpx.kernel
def kernel_vector_sum(a, b, c):
i = ndpx.get_global_id(0)
c[i] = a[i] + b[i]
Kernel Invocation#
The kernel launch parameter syntax for specifying global and local sizes are
similar to SYCL’s range and ndrange classes. The global and local
sizes need to be specified with numba_dpex’s Range and NdRange
classes.
For example, below is a following kernel that computes a sum of two vectors:
import numba_dpex as ndpx
# Data parallel kernel implementing vector sum
@ndpx.kernel
def kernel_vector_sum(a, b, c):
i = ndpx.get_global_id(0)
c[i] = a[i] + b[i]
If the global size parameter is needed to run, it could be like this (where
global_size is an int):
import numba_dpex as ndpx
# Utility function for printing and testing
def driver(a, b, c, global_size):
kernel_vector_sum[ndpx.Range(global_size)](a, b, c)
a_np = dpnp.asnumpy(a) # Copy dpnp array a to NumPy array a_np
b_np = dpnp.asnumpy(b) # Copy dpnp array b to NumPy array b_np
c_np = dpnp.asnumpy(c) # Copy dpnp array c to NumPy array c_np
testing.assert_equal(c_np, a_np + b_np)
If both local and global ranges are needed, they can be specified using two
instances of Range inside an NdRange object. For example, below is a
kernel to compute pair-wise Euclidean distances of n-dimensional data points:
import numba_dpex as ndpx
@ndpx.kernel
def pairwise_distance(X, D, xshape0, xshape1):
"""
An Euclidean pairwise distance computation implemented as
a ``kernel`` function.
"""
idx = ndpx.get_global_id(0)
d0 = X[idx, 0] - X[idx, 0]
# for i in range(xshape0):
for j in range(X.shape[0]):
d = d0
for k in range(X.shape[1]):
tmp = X[idx, k] - X[j, k]
d += tmp * tmp
D[idx, j] = sqrt(d)
Now the local and global sizes can be specified as follows (here both args.n
and args.l are int):
import numba_dpex as ndpx
# Global work size is equal to the number of points
global_size = ndpx.Range(args.n)
# Local Work size is optional
local_size = ndpx.Range(args.l)
def driver():
# measure running time
times = list()
for repeat in range(args.r):
start = time()
pairwise_distance[ndpx.NdRange(global_size, local_size)](
X, D, X.shape[0], X.shape[1]
)
end = time()
total_time = end - start
times.append(total_time)
return times
Kernel Indexing Functions#
In data parallel kernel programming all work items are enumerated and accessed
by their index. You will use numba_dpex.get_global_id() function to get the
index of a current work item from the kernel. The total number of work items can
be determined by calling numba_dpex.get_global_size() function.
The work group size can be determined by calling numba_dpex.get_local_size()
function. Work items in the current work group are accessed by calling
numba_dpex.get_local_id().
The total number of work groups are determined by calling
numba_dpex.get_num_groups() function. The current work group index is
obtained by calling numba_dpex.get_group_id() function.