Kernel Programming Basics#

Data Parallel Extensions for Python* introduce a concept of an offload kernel, defined as a part of a Python program being submitted for execution to the device queue.

Offload Kernel

There are multiple ways how to write offload kernels. CUDA*, OpenCl*, and SYCL* offer similar programming model known as the data parallel kernel programming. In this model you express the work in terms of work items. You split data into small pieces, and each piece will be a unit of work, or a work item. The total number of work items is called global size. You can also group work items into bigger chunks called work groups. The number of work items in the work group is called the local size.

Offload Kernel

In this example there are 48 work items (8 in dimension 0, and 6 in dimension 1), that is the global size is 48. Work items are grouped in work groups with the local size 8 (4 in dimension 0, and 2 in dimension 1). There are total 48/8 = 6 work groups.

In the data parallel kernel programming model you write a function that processes a given work item. Such a function is called the data parallel kernel.

Data Parallel Extension for Numba offers a way to write data parallel kernels directly using Python using numba_dpex.kernel. It bears similarities with numba.cuda and numba.roc, but unlike these proprietary programming models numba_dpex is built on top of SYCL* , which is hardware agnostic, meaning that with numba_dpex.kernel programming model you will be able to write a portable code targeting different hardware vendors.

Note

The current version of numba-dpex supports Intel SYCL devices only