Writing Data Parallel Kernels#
Kernel Declaration#
A kernel function is a device function that is meant to be called from host code, where a device can be any SYCL supported device such as a GPU, CPU, or an FPGA. The main characteristics of a kernel function are:
Scalars must be passed as an array. Kernels operate with
dpnp
array arguments only. If your want a scalar argument, then represent it as 0-dimensionaldpnp
array.
Note
Please refer to Data Parallel Extension for Numpy* to learn more about dpnp
.
Kernels cannot explicitly return a value. All result data must be written to
dpnp
array passed as a function’s argument.
Here is an example of a kernel that computes sum of two vectors a
and b
.
Arguments are two input vectors a
and b
and one output vector c
for
storing the result of vector summation:
import numba_dpex as ndpx
# Data parallel kernel implementing vector sum
@ndpx.kernel
def kernel_vector_sum(a, b, c):
i = ndpx.get_global_id(0)
c[i] = a[i] + b[i]
Kernel Invocation#
The kernel launch parameter syntax for specifying global and local sizes are
similar to SYCL
’s range
and ndrange
classes. The global and local
sizes need to be specified with numba_dpex
’s Range
and NdRange
classes.
For example, below is a following kernel that computes a sum of two vectors:
import numba_dpex as ndpx
# Data parallel kernel implementing vector sum
@ndpx.kernel
def kernel_vector_sum(a, b, c):
i = ndpx.get_global_id(0)
c[i] = a[i] + b[i]
If the global size
parameter is needed to run, it could be like this (where
global_size
is an int
):
import numba_dpex as ndpx
# Utility function for printing and testing
def driver(a, b, c, global_size):
kernel_vector_sum[ndpx.Range(global_size)](a, b, c)
a_np = dpnp.asnumpy(a) # Copy dpnp array a to NumPy array a_np
b_np = dpnp.asnumpy(b) # Copy dpnp array b to NumPy array b_np
c_np = dpnp.asnumpy(c) # Copy dpnp array c to NumPy array c_np
testing.assert_equal(c_np, a_np + b_np)
If both local and global ranges are needed, they can be specified using two
instances of Range
inside an NdRange
object. For example, below is a
kernel to compute pair-wise Euclidean distances of n-dimensional data points:
import numba_dpex as ndpx
@ndpx.kernel
def pairwise_distance(X, D, xshape0, xshape1):
"""
An Euclidean pairwise distance computation implemented as
a ``kernel`` function.
"""
idx = ndpx.get_global_id(0)
d0 = X[idx, 0] - X[idx, 0]
# for i in range(xshape0):
for j in range(X.shape[0]):
d = d0
for k in range(X.shape[1]):
tmp = X[idx, k] - X[j, k]
d += tmp * tmp
D[idx, j] = sqrt(d)
Now the local and global sizes can be specified as follows (here both args.n
and args.l
are int
):
import numba_dpex as ndpx
# Global work size is equal to the number of points
global_size = ndpx.Range(args.n)
# Local Work size is optional
local_size = ndpx.Range(args.l)
def driver():
# measure running time
times = list()
for repeat in range(args.r):
start = time()
pairwise_distance[ndpx.NdRange(global_size, local_size)](
X, D, X.shape[0], X.shape[1]
)
end = time()
total_time = end - start
times.append(total_time)
return times
Kernel Indexing Functions#
In data parallel kernel programming all work items are enumerated and accessed
by their index. You will use numba_dpex.get_global_id()
function to get the
index of a current work item from the kernel. The total number of work items can
be determined by calling numba_dpex.get_global_size()
function.
The work group size can be determined by calling numba_dpex.get_local_size()
function. Work items in the current work group are accessed by calling
numba_dpex.get_local_id()
.
The total number of work groups are determined by calling
numba_dpex.get_num_groups()
function. The current work group index is
obtained by calling numba_dpex.get_group_id()
function.