Code-generation based on a device

Kernels are written in a device-agnostic fashion making it easy to write portable code. A kernel is compiled for the device on which the kernel is enqueued to be executed. The device is specified using a dpctl.device_context context manager. In the following example, two versions of the sum kernel are compiled, one for a GPU and another for a CPU based on which context the function was invoked. Currently, OpenCL CPU and GPU devices and Level Zero GPU devices are supported. In future, compilation support may be extended to other type of SYCL devices that are supported by DPC++’s runtime.

import numpy as np
import numba_dpex, numba_dpex as dppy
import dpctl

@dppy.kernel
    def sum(a, b, c):
    i = dppy.get_global_id(0)
    c[i] = a[i] + b[i]

a = np.array(np.random.random(20), dtype=np.float32)
b = np.array(np.random.random(20), dtype=np.float32)
c = np.ones_like(a)

with dpctl.device_context("level_zero:gpu"):
    sum[20, dppy.DEFAULT_LOCAL_SIZE](a, b, c)

with dpctl.device_context("opencl:cpu"):
    sum[20, dppy.DEFAULT_LOCAL_SIZE](a, b, c)

Automatic offload of NumPy expressions

A key distinction between numba-dpex and other the GPU backends in Numba is the ability to automatically offload specific data-parallel sections of a Numba jit function.

Todo

Details and examples to be added.

Controllable Fallback

By default, if a section of code cannot be offloaded to the GPU, it is automatically executed on the CPU and warning is printed. This behavior is only applicable to jit functions, auto-offloading of NumPy calls, array expressions and prange loops. To disable this functionality and force code running on GPU set the environment variable NUMBA_DPEX_FALLBACK_ON_CPU to false (e.g. export NUMBA_DPEX_FALLBACK_ON_CPU=0). In this case the code is not automatically offloaded to the CPU and errors occur if any.

Offload Diagnostics

Setting the debug environment variable NUMBA_DPEX_OFFLOAD_DIAGNOSTICS (e.g. export NUMBA_DPEX_OFFLOAD_DIAGNOSTICS=1) provides emission of the parallel and offload diagnostics information based on produced parallel transforms. The level of detail depends on the integer value between 1 and 4 that is set to the environment variable (higher is more detailed). In the “Auto-offloading” section there is the information on which device (device name) this parfor or kernel was offloaded.