Heterogeneous Systems and Programming Concepts

This section introduces the basic concepts defined by SYCL standard for programming heterogeneous system, and used by dpctl.

Note

For SYCL-level details, refer to a more topical SYCL reference, such as the SYCL 2020 spec.

Definitions

  • Heterogeneous computing

    Refers to computing on multiple devices in a program.

  • Host

    Every program starts by running on a host, and most of the lines of code in a program, in particular lines of code implementing the Python interpreter itself, are usually for the host. Hosts are customarily CPUs.

  • Device

    A device is a processing unit connected to a host that is programmable with a specific device driver. Different types of devices can have different architectures (CPUs, GPUs, FPGA, ASICs, DSP) but are programmable using the same oneAPI programming model.

  • Platform

    Platform is an abstraction to represent a collection of devices addressable by the same lower-level framework. As multiple devices of the same type can programmed by the same framework, a platform may contain multiple devices. The same physical hardware (for example, GPU) may be programmable by different lower-level frameworks, and hence be enumerated as part of different platforms. For example, the same GPU hardware can be listed as an OpenCL* GPU device and a Level-Zero* GPU device.

  • Context

    Holds the runtime information needed to operate on a device or a group of devices from the same platform. Contexts are relatively expensive to create and should be reused as much as possible.

  • Queue

    A queue is needed to schedule the execution of any computation or data copying on the device. Queue construction requires specifying a device and a context targeting that device as well as additional properties, such as whether profiling information should be collected or submitted tasks are executed in the order in which they were submitted.

  • Event

    An event holds information related to computation/data movement operation scheduled for execution on a queue, such as its execution status as well as profiling information if the queue the task was submitted to allowed for collection of such information. Events can be used to specify task dependencies as well as to synchronize host and devices.

  • Unified Shared Memory

    Unified Shared Memory (USM) refers to pointer-based device memory management. USM allocations are bound to context. It means, a pointer representing USM allocation can be unambiguously mapped to the data it represents only if the associated context is known. USM allocations are accessible by computational kernels that are executed on a device, provided that the allocation is bound to the same context that is used to construct the queue where the kernel is scheduled for execution.

    Depending on the capability of the device, USM allocations can be:

Name

Host accessible

Device accessibility

Device allocation

No

Refers to an allocation in host memory that is accessible from a device.

Shared allocation

Yes

Accessible by both the host and device.

Host allocation

Yes

Accessible by both the host and device.

Runtime manages synchronization of the host’s and device’s view into shared allocations. The initial placement of the shared allocations is not defined.

  • Backend

    Refers to the implementation of oneAPI programming model using a lower-level heterogeneous programming API. Amongst examples of backends are “cuda”, “hip”, “level_zero”, “opencl”. In particular backend implements a platform abstraction.

Platform

A platform abstracts one or more SYCL devices that are connected to a host and can be programmed by the same underlying framework.

The dpctl.SyclPlatform class represents a platform and abstracts the sycl::platform SYCL runtime class.

To obtain all platforms available on a system programmatically, use dpctl.lsplatform() function. Refer to Enumerating available devices for more information.

It is possible to select devices from specific backend, and hence belonging to the same platform, by using ONEAPI_DEVICE_SELECTOR environment variable, or by using a filter selector string.

Context

A context is an entity that is associated with the state of device as managed by the backend. The context is required to map unified address space pointer to the device where it was allocated unambiguously.

In order for two DPC++-based Python extensions to share USM allocations, e.g. as part of DLPack exchange, they each must use the same SYCL context when submitting for execution programs that would access this allocation.

Since sycl::context is dynamically constructed by each extension sharing a USM allocation, in general, requires sharing the sycl::context along with the USM pointer, as it is done in __sycl_usm_array_interface__ attribute.

Since DLPack itself does not provide for storing of the sycl::context, the proper working of dpctl.tensor.from_dlpack() function is only supported for devices of those platforms that support default platform context SYCL extension sycl_ext_oneapi_default_platform_context, and only of those allocations that are bound to this default context.

To query where a particular device dev belongs to a platform that implements the default context, check whether dev.sycl_platform.default_context returns an instance of dpctl.SyclContext or raises an exception.

Queue

SYCL queue is an entity associated with scheduling computational tasks for execution on a targeted SYCL device and using some specific SYCL context.

Queue constructor generally requires both to be specified. For platforms that support the default platform context, a shortcut queue constructor call that specifies only a device would use the default platform context associated with the platform given device is a part of.

Queues constructed from device instance or filter string that selects it have the same context
>>> import dpctl
>>> d = dpctl.SyclDevice("gpu")
>>> q1 = dpctl.SyclQueue(d)
>>> q2 = dpctl.SyclQueue("gpu")
>>> q1.sycl_context == q2.sycl_context, q1.sycl_device == q2.sycl_device
(True, True)
>>> q1 == q2
False

Even through q1 and q2 instances of dpctl.SyclQueue target the same device and use the same context they do not compare equal, since they correspond to two independent scheduling entities.

Note

dpctl.tensor.usm_ndarray objects one associated with q1 and another associated with q2 could not be combined in a call to the same function that implements compute-follows-data programming model in dpctl.tensor.

Event

SYCL event is an entity created when a task is submitted to SYCL queue for execution. The event are be used to order execution of computational tasks by the DPC++ runtime. They may also contain profiling information associated with the submitted task, provided the queue was created with “enable_profiling” property.

SYCL event can be used to synchronize execution of the associated task with execution on host by using dpctl.SyclEvent.wait().

Methods dpctl.SyclQueue.submit_async() and dpctl.SyclQueue.memcpy_async() return dpctl.SyclEvent instances.

Note

At this point, dpctl.tensor does not provide public API for accessing SYCL events associated with submission of computation tasks implementing operations on dpctl.tensor.usm_ndarray objects.

Unified Shared Memory

Unified Shared Memory allocations of each kind are represented through Python classes dpctl.memory.MemoryUSMDevice, dpctl.memory.MemoryUSMShared, and dpctl.memory.MemoryUSMHost.

These class constructors allow to make USM allocations of requested size in bytes on the devices targeted by given SYCL queue, and are bound to the context from that queue. This queue argument is stored the instance of the class and is used to submit tasks to when performing copying of elements from or to this allocation or when filling the allocation with values.

Classes that represent host-accessible USM allocations, i.e. types USM-shared and USM-host, expose Python buffer interface.

>>> import dpctl.memory as dpm
>>> import numpy as np

>>> # allocate USM-shared memory for 6 32-bit integers
>>> mem_d = dpm.MemoryUSMDevice(26)
>>> mem_d.copy_from_host(b"abcdefghijklmnopqrstuvwxyz")

>>> mem_s = dpm.MemoryUSMShared(30)
>>> mem_s.memset(value=ord(b"-""))
>>> mem_s.copy_from_device(mem_d)

>>> # since USM-shared is host-accessible,
>>> # it implements Python buffer protocol that allows
>>> # for Python objects to read this USM allocation
>>> bytes(mem_s)
b'abcdefghijklmnopqrstuvwxyz--'

Backend

Intel(R) oneAPI Data Parallel C++ compiler ships with two backends:

  1. OpenCL backend

  2. Level-Zero backend

Additional backends can be added to the compiler by installing CodePlay’s plugins:

  1. CUDA backend: provided by oneAPI for NVIDIA(R) GPUs from CodePlay

  2. HIP backend: provided by oneAPI for AMD GPUs from CodePlay

When building open source Intel LLVM compiler from source the project can be configured to enable different backends (see Get Started Guide for further details).