Writing SYCL Kernels
Introduction
numba-dpex
offers a way of programming SYCL supporting devices using Python.
Similar to SYCL’s C++ programming model for heterogeneous computing,
the extension offers Python abstractions for expressing data-parallelism using
a hierarchical syntax. Note that not all SYCL concepts are currently supported
in the extension, and some of the concepts may not be a good fit for Python.
The explicit kernel programming mode bears lots of
similarities with Numba’s other GPU backends:numba.cuda
and numba.roc
.
Readers who are familiar with either of the existing backends of Numba, or in
general with OpenCL, CUDA, or SYCL programming should find writing kernels in
numba-dpex
extremely intuitive. Irrespective of the reader’s level of
familiarity with GPU programming frameworks, this documentation should serves
as a guide for using the current features available in the extension.
Kernel declaration
A kernel function is a device function that is meant to be called from host code, where a device can be any SYCL supported device such as a GPU, CPU, or an FPGA. The present focus of development is mainly on Intel’s GPU hardware. The main characteristics of a kernel function are:
kernels cannot explicitly return a value; all result data must be written to an array passed to the function (if computing a scalar, you will probably pass a one-element array)
kernels explicitly declare their thread hierarchy when called: i.e. the number of thread blocks and the number of threads per block (note that while a kernel is compiled once, it can be called multiple times with different block sizes or grid sizes).
Example
#! /usr/bin/env python
# Copyright 2020, 2021 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import dpctl
import numpy as np
import numpy.testing as testing
import numba_dpex as dppy
@dppy.kernel
def data_parallel_sum(a, b, c):
"""
Vector addition using the ``kernel`` decorator.
"""
i = dppy.get_global_id(0)
c[i] = a[i] + b[i]
def driver(a, b, c, global_size):
print("A : ", a)
print("B : ", b)
data_parallel_sum[global_size, dppy.DEFAULT_LOCAL_SIZE](a, b, c)
print("A + B = ")
print("C ", c)
testing.assert_equal(c, a + b)
def main():
global_size = 10
N = global_size
print("N", N)
a = np.array(np.random.random(N), dtype=np.float32)
b = np.array(np.random.random(N), dtype=np.float32)
c = np.ones_like(a)
# Use the environment variable SYCL_DEVICE_FILTER to change the default device.
# See https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#sycl_device_filter.
device = dpctl.select_default_device()
print("Using device ...")
device.print_device_info()
with dpctl.device_context(device):
driver(a, b, c, global_size)
print("Done...")
if __name__ == "__main__":
main()
Kernel invocation
A kernel is typically launched in the following way:
def driver(a, b, c, global_size):
print("A : ", a)
print("B : ", b)
data_parallel_sum[global_size, dppy.DEFAULT_LOCAL_SIZE](a, b, c)
print("A + B = ")
print("C ", c)
testing.assert_equal(c, a + b)
Indexing functions
Currently, the following indexing functions are supported. They have the same semantics as OpenCL.
numba_dpex.get_local_id
numba_dpex.get_local_size
numba_dpex.get_group_id
numba_dpex.get_num_groups