Programming model

Get started with oneAPI Python-extensions

In a heterogeneous system there may be multiple devices a Python user may want to engage. For example, it is common for a consumer-grade laptop to feature an integrated or a discrete GPU alongside a CPU.

To harness their power one needs to know how to answer the following 3 key questions:

  1. How does a Python program recognize available computational devices?
  2. How does a Python workload specify computations to be offloaded to selected devices?
  3. How does a Python application manage data sharing?

Recognizing available devices

Python package dpctl answers these questions. All the computational devices known to the underlying DPC++ runtime can be accessed using dpctl.get_devices(). A specific device of interest can be selected either using a helper function, e.g. dpctl.select_gpu_device(), or by passing a filter selector string to dpctl.SyclDevice constructor.

import dpctl

# select a GPU device. If multiple devices present, 
# let the underlying runtime select from GPUs
dev_gpu = dpctl.SyclDevice("gpu")
# select a CPU device
dev_cpu = dpctl.SyclDevice("cpu")

# stand-alone function, equivalent to C++ 
#   `auto dev = sycl::gpu_selector().select_device();`
dev_gpu_alt = dpctl.select_gpu_device()
# stand-alone function, equivalent to C++ 
#   `auto dev = sycl::cpu_selector().select_device();`
dev_cpu_alt = dpctl.select_cpu_device()

A device object can be used to query properies of the device, such as its name, vendor, maximal number of computational units, memory size, etc.

Specifying offload target

To answer the second question on the list we need a digression to explain offloading in oneAPI DPC++ first.

A computational task is offloaded for execution on a device by submitting it to DPC++ runtime which inserts the task in a computational graph. Once the device becomes available the runtime selects a task whose dependencies are met for execution. The computational graph as well as the device targeted by its tasks are stored in a SYCL queue object. The task submission is therefore always associated with a queue.

Queues can be constructed directly from a device object, or by using a filter selector string to indicate the device to construct:

# construct queue from device object
q1 = dpctl.SyclQueue(dev_gpu)
# construct queue using filter selector
q2 = dpctl.SyclQueue("gpu")

The computational tasks can be stored in an oneAPI native extension in which case their submission is orchestrated during Python API calls. Let’s consider a function that offloads an evaluation of a polynomial for every point of a NumPy array X. Such a function needs to receive a queue object to indicate which device the computation must be offloaded to:

# allocate space for the result
Y = np.empty_like(X)
# evaluate polynomial on the device targeted by the queue, Y[i] = p(X[i])
onapi_ext.offloaded_poly_evaluate(exec_q, X, Y)

Python call to onapi_ext.offloaded_poly_evaluate applied to NumPy arrays of double precision floating pointer numbers gets translated to the following sample C++ code:

void 
cpp_offloaded_poly_evaluate(
  sycl::queue q, const double *X, double *Y, size_t n) {    
    // create buffers from malloc allocations to make data accessible from device
    sycl::buffer<1, double> buf_X(X, n);
    sycl::buffer<1, double> buf_Y(Y, n);

    q.submit([&](sycl::handler &cgh) {
        // create buffer accessors indicating kernel data-flow pattern  
        sycl::accessor acc_X(buf_X, cgh, sycl::read_only);
        sycl::accessor acc_Y(buf_Y, cgh, sycl::write_only, sycl::no_init);

        cgh.parallel_for(n,
           // lambda function that gets executed by different work-items with 
           // different arguments in parallel
           [=](sycl::id<1> id) {
              auto x = accX[id];
              accY[id] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x));
           });
    }).wait();

    return;
}

We refer an interested reader to an excellent and freely available “Data Parallel C++” book for details of this data parallel C++.

Our package numba_dpex allows one to write kernels directly in Python.

import numba_dpex

@numba_dpex.kernel
def numba_dpex_poly(X, Y):
    i = numba_dpex.get_global_id(0)
    x = X[i]
    Y[i] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x))

Specifying the execution queue is done using Python context manager:

import numpy as np

X = np.random.randn(10**6)
Y = np.empty_like(X)

with dpctl.device_context(q):
    # apply the kernel to elements of X, writing value into Y, 
    # while executing using given queue
    numba_dpex_poly[X.size, numba_dpex.DEFAULT_LOCAL_SIZE](X, Y)

The argument to device_context can be a queue object, a device object for which a temporary queue will be created, or a filter selector string. Thus we could have equally used dpctl.device_context(gpu_dev) or dpctl.device_context("gpu").

Note that in this examples data sharing was implicitly managed for us: in the case of calling a function from a precompiled oneAPI native extension data sharing was managed by DPC++ runtime, while in the case of using numba_dpex kernel it was managed during execution of __call__ method.

Data sharing

Implicit management of data is surely convenient, but its use in an interpreted code comes at a performance cost. A runtime must implicitly copy data from host to the device before the kernel execution commences and then copy some (or all) of it back after the execution completes for every Python API call.

dpctl provides for allocating memory directly accessible to kernels executing on a device using SYCL’s Unified Shared Memory (USM) feature. It also implements USM-based ND-array object dpctl.tensor.usm_ndarray that conforms array-API standard.

import dpctl.tensor as dpt

# allocate array of doubles using USM-device allocation on GPU device
X = dpt.arange(0., end=1.0, step=1e-6, device="gpu", usm_type="device")
# allocate array for the output
Y = dpt.empty_like(X)

# execution queue is inferred from allocation queues.
# Kernel is executed on the same device where arrays were allocated
numba_dpex_poly[X.size, numba_dpex.DEFAULT_LOCAL_SIZE](X, Y)

The execution queue can be unambiguously determined in this case since both arguments are USM arrays with the same allocation queues and X.sycl_queue == Y.sycl_queue evaluates to True. Should allocation queues be different, such an inference becomes ambiguous and numba_dpex raises IndeterminateExecutionQueueError advising user to explicitly migrate the data.

Migration can be accomplished either by using dpctl.tensor.asarray(X, device=target_device) to create a copy, or by using X.to_device(target_device) method.

A USM array can be copied back into a NumPy array using dpt.asnumpy(Y) if needed.

Compute follows data

Automatic deduction of the execution queue from allocation queues is consitent with “local control for data allocation target” in the array API standard. User has full control over memory allocation through three keyword arguments present in all array creation functions. For example, consider

# Use usm_type = 'device' to get USM-device allocation (default), 
#     usm_type = 'shared' to get USM-shared allocation,
#     usm_type = 'host'   to get USM-host allocation
def dpctl.tensor.empty(..., device=None, usm_type=None, sycl_queue=None) -> dpctl.tensor.usm_ndarray: ...

The keyword device is mandated by the array API. In dpctl.tensor the allowed values of the keyword are

  • Filter selector string, e.g. device="gpu:0"
  • Existing dpctl.SyclDevice object, e.g. device=dev_gpu
  • Existing dpctl.SyclQueue object
  • dpctl.tensor.Device object instance obtained from an existing USM array, e.g. device=X.device

In all cases, an allocation queue object will be constructed as described earlier and stored in the array instance, accessible with X.sycl_queue. Instead of using device keyword, one can alternatively use sycl_queue keyword for readability to directly specify a dpctl.SyclQueue object to be used as the allocation queue.

The rationale for storing the allocation queue in the array is that kernels submitted to this queue are guaranteed to be able to correctly dereference (i.e. access) the USM pointer. Array operations that only involve this single USM array can thus execute on the allocation queue, and the output array can be allocated on this same allocation queue with the same usm type as the input array.

Compute follows data is the rule prescribing deduction of the execution and the allocation queue as well as the USM type for the result when multiple USM arrays are combined. It stipulates that arrays can be combined if and only if their allocation queues are the same as measured by == operator (i.e. X.sycl_queue == Y.sycl_queue must evaluate to True). Same queues refer to the same underlying task graphs and DPC++ schedulers.

An attempt to combine USM arrays with unsame allocation queues raises an exception advising the user to migrate the data. Migration can be accomplished either by using dpctl.tensor.asarray(X, device=Y.device) to create a copy, or by using X.to_device(Y.device) method which can sometime do the migration more efficiently.