In a range kernel, the kernel execution is scheduled over a set of work-items without any explicit grouping of the work-items. The basic form of parallelism that can be expressed using a range kernel does not allow expressing any notion of locality within the kernel. To get around that limitation, kapi provides a second form of expressing a parallel kernel that is called an nd-range kernel. An nd-range kernel represents a data-parallel execution of the kernel by a set of explicitly defined groups of work-items. An individual group of work-items is called a work-group. Example: Sliding window matrix multiplication as an nd-range kernel demonstrates an nd-range kernel and some of the advanced features programmers can use in this type of kernel.

Example: Sliding window matrix multiplication as an nd-range kernel
 1from numba_dpex import kernel_api as kapi
 2import numba_dpex as dpex
 3import numpy as np
 4import dpctl.tensor as dpt
 5
 6square_block_side = 2
 7work_group_size = (square_block_side, square_block_side)
 8dtype = np.float32
 9
10
11@dpex.kernel
12def matmul(
13    nditem: kapi.NdItem,
14    X,  # IN READ-ONLY    (X_n_rows, n_cols)
15    y,  # IN READ-ONLY    (n_cols, y_n_rows),
16    X_slm,  # SLM to store a sliding window over X
17    Y_slm,  # SLM to store a sliding window over Y
18    result,  # OUT        (X_n_rows, y_n_rows)
19):
20    X_n_rows = X.shape[0]
21    Y_n_cols = y.shape[1]
22    n_cols = X.shape[1]
23
24    result_row_idx = nditem.get_global_id(0)
25    result_col_idx = nditem.get_global_id(1)
26
27    local_row_idx = nditem.get_local_id(0)
28    local_col_idx = nditem.get_local_id(1)
29
30    n_blocks_for_cols = n_cols // square_block_side
31    if (n_cols % square_block_side) > 0:
32        n_blocks_for_cols += 1
33
34    output = dtype(0)
35
36    gr = nditem.get_group()
37
38    for block_idx in range(n_blocks_for_cols):
39        X_slm[local_row_idx, local_col_idx] = dtype(0)
40        Y_slm[local_row_idx, local_col_idx] = dtype(0)
41        if (result_row_idx < X_n_rows) and (
42            (local_col_idx + (square_block_side * block_idx)) < n_cols
43        ):
44            X_slm[local_row_idx, local_col_idx] = X[
45                result_row_idx, local_col_idx + (square_block_side * block_idx)
46            ]
47
48        if (result_col_idx < Y_n_cols) and (
49            (local_row_idx + (square_block_side * block_idx)) < n_cols
50        ):
51            Y_slm[local_row_idx, local_col_idx] = y[
52                local_row_idx + (square_block_side * block_idx), result_col_idx
53            ]
54
55        kapi.group_barrier(gr)
56
57        for idx in range(square_block_side):
58            output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx]
59
60        kapi.group_barrier(gr)
61
62    if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
63        result[result_row_idx, result_col_idx] = output
64
65
66def _arange_reshaped(shape, dtype):
67    n_items = shape[0] * shape[1]
68    return np.arange(n_items, dtype=dtype).reshape(shape)
69
70
71X = _arange_reshaped((5, 5), dtype)
72Y = _arange_reshaped((5, 5), dtype)
73X = dpt.asarray(X)
74Y = dpt.asarray(Y)
75device = X.device.sycl_device
76result = dpt.zeros((5, 5), dtype, device=device)
77X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
78Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
79
80dpex.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result)

When writing an nd-range kernel, a programmer defines a set of groups of work-items instead of a flat execution range.There are several semantic rules associated both with a work-group and the work-items in a work-group:

  • Each work-group gets executed in an arbitrary order by the underlying runtime and programmers should not assume any implicit ordering.

  • Work-items in different wok-groups cannot communicate with each other except via atomic operations on global memory.

  • Work-items within a work-group share a common memory region called “shared local memory” (SLM). Depending on the device the SLM maybe mapped to a dedicated fast memory.

  • Work-items in a work-group can synchronize using a numba_dpex.kernel_api.group_barrier() operation that can additionally guarantee memory consistency using a work-group memory fence.

Note

The SYCL language provides additional features for work-items in a work-group such as group functions that specify communication routines across work-items and also implement patterns such as reduction and scan. These features are not yet available in numba-dpex.

An nd-range kernel needs to be launched with an instance of the numba_dpex.kernel_api.NdRange class and the first argument to an nd-range kernel has to be an instance of numba_dpex.kernel_api.NdItem. Apart from the need to provide an `NdItem parameter, the rest of the semantic rules that apply to a range kernel also apply to an nd-range kernel.