In a range kernel, the kernel execution is scheduled over a set of work-items without any explicit grouping of the work-items. The basic form of parallelism that can be expressed using a range kernel does not allow expressing any notion of locality within the kernel. To get around that limitation, kapi provides a second form of expressing a parallel kernel that is called an nd-range kernel. An nd-range kernel represents a data-parallel execution of the kernel by a set of explicitly defined groups of work-items. An individual group of work-items is called a work-group. Example: Sliding window matrix multiplication as an nd-range kernel demonstrates an nd-range kernel and some of the advanced features programmers can use in this type of kernel.
1from numba_dpex import kernel_api as kapi
2import numba_dpex as dpex
3import numpy as np
4import dpctl.tensor as dpt
5
6square_block_side = 2
7work_group_size = (square_block_side, square_block_side)
8dtype = np.float32
9
10
11@dpex.kernel
12def matmul(
13 nditem: kapi.NdItem,
14 X, # IN READ-ONLY (X_n_rows, n_cols)
15 y, # IN READ-ONLY (n_cols, y_n_rows),
16 X_slm, # SLM to store a sliding window over X
17 Y_slm, # SLM to store a sliding window over Y
18 result, # OUT (X_n_rows, y_n_rows)
19):
20 X_n_rows = X.shape[0]
21 Y_n_cols = y.shape[1]
22 n_cols = X.shape[1]
23
24 result_row_idx = nditem.get_global_id(0)
25 result_col_idx = nditem.get_global_id(1)
26
27 local_row_idx = nditem.get_local_id(0)
28 local_col_idx = nditem.get_local_id(1)
29
30 n_blocks_for_cols = n_cols // square_block_side
31 if (n_cols % square_block_side) > 0:
32 n_blocks_for_cols += 1
33
34 output = dtype(0)
35
36 gr = nditem.get_group()
37
38 for block_idx in range(n_blocks_for_cols):
39 X_slm[local_row_idx, local_col_idx] = dtype(0)
40 Y_slm[local_row_idx, local_col_idx] = dtype(0)
41 if (result_row_idx < X_n_rows) and (
42 (local_col_idx + (square_block_side * block_idx)) < n_cols
43 ):
44 X_slm[local_row_idx, local_col_idx] = X[
45 result_row_idx, local_col_idx + (square_block_side * block_idx)
46 ]
47
48 if (result_col_idx < Y_n_cols) and (
49 (local_row_idx + (square_block_side * block_idx)) < n_cols
50 ):
51 Y_slm[local_row_idx, local_col_idx] = y[
52 local_row_idx + (square_block_side * block_idx), result_col_idx
53 ]
54
55 kapi.group_barrier(gr)
56
57 for idx in range(square_block_side):
58 output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx]
59
60 kapi.group_barrier(gr)
61
62 if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
63 result[result_row_idx, result_col_idx] = output
64
65
66def _arange_reshaped(shape, dtype):
67 n_items = shape[0] * shape[1]
68 return np.arange(n_items, dtype=dtype).reshape(shape)
69
70
71X = _arange_reshaped((5, 5), dtype)
72Y = _arange_reshaped((5, 5), dtype)
73X = dpt.asarray(X)
74Y = dpt.asarray(Y)
75device = X.device.sycl_device
76result = dpt.zeros((5, 5), dtype, device=device)
77X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
78Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
79
80dpex.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result)
When writing an nd-range kernel, a programmer defines a set of groups of work-items instead of a flat execution range.There are several semantic rules associated both with a work-group and the work-items in a work-group:
Each work-group gets executed in an arbitrary order by the underlying runtime and programmers should not assume any implicit ordering.
Work-items in different wok-groups cannot communicate with each other except via atomic operations on global memory.
Work-items within a work-group share a common memory region called “shared local memory” (SLM). Depending on the device the SLM maybe mapped to a dedicated fast memory.
Work-items in a work-group can synchronize using a
numba_dpex.kernel_api.group_barrier()
operation that can additionally guarantee memory consistency using a work-group memory fence.
Note
The SYCL language provides additional features for work-items in a work-group such as group functions that specify communication routines across work-items and also implement patterns such as reduction and scan. These features are not yet available in numba-dpex.
An nd-range kernel needs to be launched with an instance of the
numba_dpex.kernel_api.NdRange
class and the first
argument to an nd-range kernel has to be an instance of
numba_dpex.kernel_api.NdItem
. Apart from the need to provide an
`NdItem
parameter, the rest of the semantic rules that apply to a range
kernel also apply to an nd-range kernel.