Writing SYCL Kernels
====================

Introduction
-------------

Numba-dpex offers a way to write data-parallel kernels directly using Python.
The compiler extension to Numba has a programming model similar to the SYCL C++
domain-specific language. By providing similar abstractions as SYCL, Python
programmers can use the compiler to express data-parallelism using a
hierarchical syntax. Note that not all SYCL concepts are currently supported by
numba-dpex.

The explicit kernel programming mode of numba-dpex bears similarities with
Numba's other GPU backends: ``numba.cuda`` and ``numba.roc``. The documentation
should serves as a guide for using the current kernel programming features
available in numba-dpex.

Kernel declaration
------------------
A kernel function is a device function that is meant to be called from host
code, where a device can be any SYCL supported device such as a GPU, CPU, or an
FPGA. The present focus of development of numba-dpex is mainly on Intel's
GPU hardware. The main characteristics of a kernel function are:

- kernels cannot explicitly return a value; all result data must be written to
  an array passed to the function (if computing a scalar, you will probably pass
  a one-element array)
- kernels explicitly declare their thread hierarchy when called: i.e. the number
  of thread blocks and the number of threads per block (note that while a kernel
  is compiled once, it can be called multiple times with different block sizes
  or grid sizes).

Example
~~~~~~~~~

.. literalinclude:: ../../../numba_dpex/examples/sum.py

Kernel invocation
------------------

A kernel is typically launched in the following way:

.. literalinclude:: ../../../numba_dpex/examples/sum.py
   :pyobject: driver

Indexing functions
------------------

Numba-dpex provides the following indexing functions that have OpenCL-like
semantics:

- ``numba_dpex.get_local_id``
- ``numba_dpex.get_local_size``
- ``numba_dpex.get_group_id``
- ``numba_dpex.get_num_groups``