Code-generation based on a device
In numba-dpex, kernels are written in a device-agnostic fashion making it
easy to write portable code. A kernel is compiled for the device on which the
kernel is enqueued to be executed. The device is specified using a
dpctl.device_context context manager. In the following example, two versions
of the sum kernel are compiled, one for a GPU and another for a CPU based on
which context the function was invoked. Numba-dpex supports
OpenCL CPU and GPU devices and Level Zero GPU devices. In future, compilation
support may be extended to other type of SYCL devices that are supported by
DPC++’s runtime.
import numpy as np import numba_dpex as dpex import dpctl @dpex.kernel def sum(a, b, c): i = dpex.get_global_id(0) c[i] = a[i] + b[i] a = np.array(np.random.random(20), dtype=np.float32) b = np.array(np.random.random(20), dtype=np.float32) c = np.ones_like(a) with dpctl.device_context("level_zero:gpu"): sum[20, dpex.DEFAULT_LOCAL_SIZE](a, b, c) with dpctl.device_context("opencl:cpu"): sum[20, dpex.DEFAULT_LOCAL_SIZE](a, b, c)
Automatic offload of NumPy expressions
A key distinction between numba-dpex and other the GPU backends in Numba is
the ability to automatically offload specific data-parallel sections of a
Numba jit function.
Todo
Details and examples to be added.
Controllable Fallback
By default, if a section of code cannot be offloaded to the GPU, it is
automatically executed on the CPU and warning is printed. This behavior is only
applicable to jit functions, auto-offloading of NumPy calls, array
expressions and prange loops. To disable this functionality and force code
running on GPU set the environment variable NUMBA_DPEX_FALLBACK_ON_CPU to
false (e.g. export NUMBA_DPEX_FALLBACK_ON_CPU=0). In this case the code is
not automatically offloaded to the CPU and errors occur if any.
Offload Diagnostics
Setting the debug environment variable NUMBA_DPEX_OFFLOAD_DIAGNOSTICS (e.g.
export NUMBA_DPEX_OFFLOAD_DIAGNOSTICS=1) provides emission of the parallel
and offload diagnostics information based on produced parallel transforms. The
level of detail depends on the integer value between 1 and 4 that is set to the
environment variable (higher is more detailed). In the “Auto-offloading” section
there is the information on which device (device name) this parfor or kernel was
offloaded.