oneAPI Python extensions

Python extensions can be built with DPC++. This is how.

Suitability of DPC++ for Python stack

DPC++ is a single source compiler. It generates both the host code and the device code in a single fat binary.

DPC++ is an LLVM-based compiler, but the host portion of the binary it produces is compatible with GCC runtime libraries on Linux and Windows runtime libraries on Windows. Thus, native Python extensions authored in C++ can be directly built with DPC++. Such extensions will require DPC++ runtime library at the runtime.

Intel(R) compute runtime needs to be present for DPC++ runtime to be able to target supported Intel devices. When using open-source DPC++ from github.com/intel/llvm compiled with support for NVIDIA CUDA, HIP NVIDIA, or HIP AMD (see intel/llvm/getting-started for details), respective runtimes and drivers will need to be present for DPC++ runtime to target these devices.

Build a data-parallel Python native extension

There are two supported ways of building a data-parallel extension: by using Cython and by using pybind11. The companion repository IntelPython/sample-data-parallel-extensions provides the examples demonstrating both approaches by implementing two prototype native extensions to evaluate Kernel Density Estimate at a set a points from a Python function with the following signature:

def kde_eval(exec_q: dpctl.SyclQueue, x : np.ndarray, data: np.ndarray, h : float) -> np.narray: ...
    """
    Args:
       q: execution queue specifying offload target
       x: NumPy array of shape (n, dim)
       d: NumPy array of shape (n_data, dim)
       h: moothing parameter
    """

The examples can be cloned locally using git:

git clone https://github.com/IntelPython/sample-data-parallel-extensions.git

The examples demonstrate a key benefit of using the dpctl package and the included Cython and pybind11 bindings for oneAPI. By using dpctl, a native extension writer can focus on writing a data-parallel kernel in DPC++ while automating the generation of the necessary Python bindings using dpctl.

Building packages with setuptools

When using setuptools we used environment variables CC and LDSHARED recognized by setuptools to ensure that dpcpp is used to compile and link extensions.

CC=dpcpp LDSHARED="dpcpp --shared" python setup.py develop

The resulting extension is a fat binary, containing both the host code with Python bindings and offloading orchestration, and the device code usually stored in cross-platform intermediate representation (SPIR-V) and compiled for the device indicated via the execution queue argument using tooling from compute runtime.

Building packages with scikit-build

Using setuptools is convenient, but may feel klunky. Using scikit-build offers an alternate way for users who prefer or are familiar with CMake.

Scikit-build enables writing the logic of Python package building in CMake which supports oneAPI DPC++. Scikit-build supports building of both Cython-generated and pybind11-generated native extensions. dpctl integration with CMake allows to conveniently using dpctl integration with these extension generators simply by including

find_package(Dpctl REQUIRED)

In order for CMake to locate the script that would make the example work, the example CMakeLists.txt in kde_skbuild package implements DPCTL_MODULE_PATH variable which can be set to output of python -m dpctl --cmakedir. Integration of DPC++ with CMake requires that CMake’s C and/or C++ compiler were set to Intel LLVM compilers provided in oneAPI base kit.

python setup.py develop -G Ninja -- \
    -DCMAKE_C_COMPILER=icx          \
    -DCMAKE_CXX_COMPILER=icpx       \
    -DDPCTL_MODULE_PATH=$(python -m dpctl --cmakedir)

Alteratively, we can rely on CMake recognizing CC and CXX environment variables to shorten the input

CC=icx CXX=icpx python setup.py develop -G Ninja -- -DDCPTL_MODULE_PATH=$(python -m dpctl --cmakedir)

Whichever way of building the data-parallel extension appeals to you, the end result allows offloading computations specified as DPC++ kernels to any supported device:

import dpctl
import numpy as np
import kde_skbuild as kde

cpu_q = dpctl.SyclQueue("cpu")
gpu_q = dpctl.SyclQueue("gpu")

# output info about targeted devices
cpu_q.print_device_info()
gpu_q.print_device_info()

x = np.linspace(0.1, 0.9, num=14000)
data = np.random.uniform(0, 1, size=10**6)

# Notice that first evaluation results in JIT-compiling the kernel
# Subsequent evaluation reuse cached binary
f0 = kde.cython_kde_eval(cpu_q, x[:, np.newaxis], data[:, np.newaxis], 3e-6)

f1 = kde.cython_kde_eval(gpu_q, x[:, np.newaxis], data[:, np.newaxis], 3e-6)

assert np.allclose(f0, f1)

The following naive NumPy implementation can be used to validate the results generated by our sample extensions. Do note that the validation script would not be able to handle very large size inputs and will raise a MemoryError exception.

def ref_kde(x, data, h):
    """
    Reference NumPy implementation for KDE evaluation
    """
    assert x.ndim == 2 and data.ndim == 2
    assert x.shape[1] == data.shape[1]
    dim = x.shape[1]
    n_data = data.shape[0]
    return np.exp(
        np.square(x[:, np.newaxis, :]-data).sum(axis=-1)/(-2*h*h)
    ).sum(axis=1)/(np.sqrt(2*np.pi)*h)**dim / n_data

Using CPU offload target allows to parallelize CPU computations. For example, try

data = np.random.uniform(0, 1, size=10**3)
x = np.linspace(0.1, 0.9, num=140)
h = 3e-3

%time fr = ref_kde(x[:,np.newaxis], data[:, np.newaxis], h)
%time f0 = kde_skbuild.cython_kde_eval(cpu_q, x[:, np.newaxis], data[:, np.newaxis], h)
%time f1 = kde_skbuild.cython_kde_eval(gpu_q, x[:, np.newaxis], data[:, np.newaxis], h)

assert np.allclose(f0, fr) and np.allclose(f1, fr)

dpctl can be used to build data-parallel Python extensions which functions operating of USM-based arrays. For example, please refer to examples/pybind11/onemkl_gemv in dpctl sources.