KDE Python extension

KDE (kernel density estimation) Python extension example.

Since SYCL builds on C++, we are going to use the pybind11 project to generate a Python extension. We also need Python objects to carry USM allocations of input and output data, such as dpctl (Data Parallel Control Python package). The dpctl package also provides Python objects corresponding to DPC++ runtime objects:

Python object	SYCL C++ object
`dpctl.SyclQueue`	`sycl::queue`
`dpctl.SyclDevice`	`sycl::device`
`dpctl.SyclContext`	`sycl::context`
`dpctl.SyclEvent`	`sycl::event`

dpctl provides integration with pybind11 supporting castings between dpctl Python objects and corresponding C++ SYCL classes listed in the table above. Furthermore, the integration provides the C++ class dpctl::tensor::usm_ndarray which derives from pybind11::object. It stores the dpctl.tensor.usm_ndarray object and provides methods to query its attributes, such as data pointer, dimensionality, shape, strides and elemental type information. Underlying dpctl.tensor.usm_ndarray is a SYCL unified shared memory (USM) allocation. See the SYCL standard or dpctl.memory documentation for more details.

For illustration purpose, here is a sample extension source code:

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <sycl/sycl.hpp>
#include "dpctl4pybind11.hpp"
#include <vector>

sycl::event
py_foo(dpctl::tensor::usm_ndarray inp,
       dpctl::tensor::usm_ndarray out,
       const std::vector<sycl::event> &deps) {
    // validation steps skipped

    // Execution queue is the queue associated with input arrays
    // these are expected to be the same and checked during validation
    sycl::queue exec_q = inp.get_queue();

    const std::int64_t *inp_ptr = inp.get_data<std::int64_t>();
    std::int64_t *out_ptr = out.get_data<std::int64_t>();

    // submit tasks for execution and obtain the event signaling
    // the status of execution of the set of tasks
    sycl::event e_impl =  impl_fn(exec_q, inp_ptr, out_ptr, depends);

    return e_impl;
}

PYBIND11_MODULE(_ext, m) {
    m.def("foo", &py_foo);
}

On the Python side, the function would be called as follows

import dpctl.tensor as dpt
import _ext

# Allocate input and output arrays on the
# default-selected device
inp = dpt.arange(100, dtype=dpt.int64)
out = dpt.empty_like(inp)
ev = _ext.foo(inp, out, [])

# ...

Since execution is offloaded to a device, it is our responsibility to ensure that USM data being worked on is not deallocated until after offloaded tasks complete the execution. The simplest way to assure of this is to wait on the event with ev.wait() on the Python side, or with e_impl.wait() on the C++ side.

Alternatively, one can assure the lifetime of arrays asynchronously, by using sycl::handler::host_task to schedule code execution on the host ordering it after kernel execution completion:

// increment reference counts of input and output arrays

sycl::event ht_ev =
    exec_q.submit([&](sycl::handler &cgh) {
        // execute host_task once e_impl signals completion
        cgh.depends_on(e_impl);

        cgh.host_task([=]() {
            // we must acquire GIL to be able to safely
            // manipulate reference counts of Python objects
            pybind11::gil_scoped_acquire guard;

            // decrement ref-counts
        });
    });

Since the host task may execute in a thread different from that of the Python interpreter (the main thread), care must be taken to avoid deadlocks: synchronization operation that call ht_ev.wait() from the main thread must release GIL to afford the body of the host task a chance at execution.

Of course, if USM memory is not managed by Python, it may be possible to avoid using GIL altogether.

An example of Python extension "kde_sycl_ext" that exposes kernel density estimation code from previous section can be found in "steps/sycl_python_extension" folder (see README).

The folder contains comparison between dpctl-based implementation of the KDE implementation following the NumPy implementation above and the dedicated C++ code:

KDE for n_sample = 1000000, n_est = 17, n_dim = 7, h = 0.05
Result agreed.
kde_dpctl took 0.3404452269896865 seconds
kde_ext[mode=0] 0.02209925901843235 seconds
kde_ext[mode=1] 0.02560457994695753 seconds
kde_ext[mode=2] 0.02815118699800223 seconds
kde_numpy 0.7227164240321144 seconds

This sample run was obtained on a laptop with 11th Gen Intel(R) Core(TM) i7-1185G7 CPU @ 3.00GHz, 32 GB of RAM, and the integrated Intel(R) Iris(R) Xe GPU, with stock NumPy 1.26.4, and development build of dpctl 0.17 built with oneAPI DPC++ 2024.1.0.