oneAPI programming model¶
oneAPI library and its Python interface¶
Using oneAPI libraries, a user calls functions that take sycl::queue
and a collection of
sycl::event
objects among other arguments. For example:
sycl::event
compute(
sycl::queue &exec_q,
...,
const std::vector<sycl::event> &dependent_events
);
The function compute
inserts computational tasks into the queue exec_q
for DPC++ runtime to
execute on the device the queue targets. The execution may begin only after other tasks whose
execution status is represented by sycl::event
objects in the provided dependent_events
vector complete. If the vector is empty, the runtime begins the execution as soon as the device is
ready. The function returns a sycl::event
object representing completion of the set of
computational tasks submitted by the compute
function.
Hence, in the oneAPI programming model, the execution queue is used to specify which device the function will execute on. To create a queue, one must specify a device to target.
In dpctl
, the sycl::queue
is represented by dpctl.SyclQueue
Python type,
and a Python API to call such a function might look like
def call_compute(
exec_q : dpctl.SyclQueue,
...,
dependent_events : List[dpctl.SyclEvent] = []
) -> dpctl.SyclEvent:
...
When building Python API for a SYCL offloading function, and you choose to map the SYCL API to a different API on the Python side, it must still translate to a similar call under the hood.
The arguments to the function must be suitable for use in the offloading functions.
Typically these are Python scalars, or objects representing USM allocations, such as
dpctl.tensor.usm_ndarray
, dpctl.memory.MemoryUSMDevice
and friends.
Note
The USM allocations these objects represent must not get deallocated before offloaded tasks that access them complete.
This is something authors of DPC++-based Python extensions must take care of, and users of such extensions should assume assured.
USM allocations in dpctl
and compute-follows-data¶
To make a USM allocation on a device in SYCL, one needs to specify sycl::device
in the
memory of which the allocation is made, and the sycl::context
to which the allocation
is bound.
A sycl::queue
object is often used instead. In such cases sycl::context
and sycl::device
associated
with the queue are used to make the allocation.
Important
dpctl
chose to associate a queue object with every USM allocation.
The associated queue may be queried using .sycl_queue
property of the
Python type representing the USM allocation.
This design choice allows dpctl
to have a preferred queue to use when operating on any single
USM allocation. For example:
def unary_func(x : dpctl.tensor.usm_ndarray):
code1
_ = _func_impl(x.sycl_queue, ...)
code2
When combining several objects representing USM-allocations, the
programming model
adopted in dpctl
insists that queues associated with each object be the same, in which
case it is the execution queue used. Alternatively dpctl.utils.ExecutionPlacementError
is raised.
def binary_func(
x1 : dpctl.tensor.usm_ndarray,
x2 : dpctl.tensor.usm_ndarray
):
exec_q = dpctl.utils.get_execution_queue((x1.sycl_queue, x2.sycl_queue))
if exec_q is None:
raise dpctl.utils.ExecutionPlacementError
...
In order to ensure that compute-follows-data works seamlessly out-of-the-box, dpctl
maintains
a cache of with context and device as keys and queues as values used by dpctl.tensor.Device
class.
>>> import dpctl
>>> from dpctl import tensor
>>> sycl_dev = dpctl.SyclDevice("cpu")
>>> d1 = tensor.Device.create_device(sycl_dev)
>>> d2 = tensor.Device.create_device("cpu")
>>> d3 = tensor.Device.create_device(dpctl.select_cpu_device())
>>> d1.sycl_queue == d2.sycl_queue, d1.sycl_queue == d3.sycl_queue, d2.sycl_queue == d3.sycl_queue
(True, True, True)
Since dpctl.tensor.Device
class is used by all array creation functions
in dpctl.tensor
, the same value used as device
keyword argument results in array instances that
can be combined together in accordance with compute-follows-data programming model.
>>> from dpctl import tensor
>>> import dpctl
>>> # queue for default-constructed device is used
>>> x1 = tensor.arange(100, dtype="int32")
>>> x2 = tensor.zeros(100, dtype="int32")
>>> x12 = tensor.concat((x1, x2))
>>> x12.sycl_queue == x1.sycl_queue, x12.sycl_queue == x2.sycl_queue
(True, True)
>>> # default constructors of SyclQueue class create different instance of the queue
>>> q1 = dpctl.SyclQueue()
>>> q2 = dpctl.SyclQueue()
>>> q1 == q2
False
>>> y1 = tensor.arange(100, dtype="int32", sycl_queue=q1)
>>> y2 = tensor.zeros(100, dtype="int32", sycl_queue=q2)
>>> # this call raises ExecutionPlacementError since compute-follows-data
>>> # rules are not met
>>> tensor.concat((y1, y2))
Please refer to the array migration section of the introduction to
dpctl.tensor
to examples on how to resolve ExecutionPlacementError
exceptions.