numba_dpex package
Subpackages
- numba_dpex.core package
- numba_dpex.dpctl_iface package
- numba_dpex.dpnp_iface package
- Submodules
- numba_dpex.dpnp_iface.dpnp_array_creations_impl module
- numba_dpex.dpnp_iface.dpnp_array_ops_impl module
- numba_dpex.dpnp_iface.dpnp_fptr_interface module
- numba_dpex.dpnp_iface.dpnp_fptr_interface module
- numba_dpex.dpnp_iface.dpnp_indexing module
- numba_dpex.dpnp_iface.dpnp_linalgimpl module
- numba_dpex.dpnp_iface.dpnp_logic module
- numba_dpex.dpnp_iface.dpnp_manipulation module
- numba_dpex.dpnp_iface.dpnp_randomimpl module
- numba_dpex.dpnp_iface.dpnp_sort_search_countimpl module
- numba_dpex.dpnp_iface.dpnp_statisticsimpl module
- numba_dpex.dpnp_iface.dpnp_transcendentalsimpl module
- numba_dpex.dpnp_iface.dpnpdecl module
- numba_dpex.dpnp_iface.dpnpimpl module
- numba_dpex.dpnp_iface.stubs module
- Module contents
- Submodules
- numba_dpex.ocl package
- numba_dpex.utils package
Submodules
- numba_dpex.codegen module
- numba_dpex.compiler module
- numba_dpex.config module
- numba_dpex.decorators module
- numba_dpex.descriptor module
- numba_dpex.device_init module
- numba_dpex.dpctl_support module
- numba_dpex.dppy_array_type module
- numba_dpex.dppy_debuginfo module
- numba_dpex.dppy_offload_dispatcher module
- numba_dpex.dppy_parfor_diagnostics module
- numba_dpex.dppy_passbuilder module
- numba_dpex.extended_numba_itanium_mangler module
- numba_dpex.initialize module
- numba_dpex.interop module
- numba_dpex.numba_support module
- numba_dpex.numpy_usm_shared module
- numba_dpex.printimpl module
- numba_dpex.retarget module
- numba_dpex.spirv_generator module
- numba_dpex.target module
- numba_dpex.vectorizers module
Module contents
Module to interact with Intel based devices
- Extensions to Numba for Intel GPUs introduce two new features into Numba:
A new backend that has a new decorator called @dppy.kernel that exposes an explicit kernel programming interface similar to the existing Numba GPU code-generation backends. The @dppy.kernel decorator currently implements a subset of OpenCL’s API through Numba’s intrinsic functions.
A new auto-offload optimizer that does automatic detection and offloading of data-parallel code sections on to a GPU or other OpenCL/SYCL devices. The auto-offload optimizer is enabled using Numba’s default @jit decorator.
Explicit Kernel Prgoramming with new Docorators:
@dppy.kernel
The @dppy.kernel decorator can be used with or without extra arguments. Optionally, users can pass the signature of the arguments to the decorator. When a signature is provided to the DK decorator the version of the OpenCL kernel generated gets specialized for that type signature.
@dppy.kernel def data_parallel_sum(a, b, c):
i = dppy.get_global_id(0) c[i] = a[i] + b[i]
To invoke the above function users will need to provide a global size (OpenCL) which is the size of a (same as b and c) and a local size (dppy.DEFAULT_LOCAL_SIZE if user don’t want to specify). Example shown below:
@dppy.func
The @dppy.func decorator is the other decorator provided in the explicit kernel programming model. This decorator allows users to write “device” functions that can be invoked from inside DK functions but cannot be invoked from the host. The decorator also supports type specialization as with the DK decorator. Functions decorated with @dppy.func will also be JIT compiled and inlined into the OpenCL Program containing the @dppy.kernel function calling it. A @dppy.func will not be launched as an OpenCL kernel.
@dppy.func def bar(a):
return a*a
@dppy.kernel def foo(in, out):
i = dppy.get_global_id(0) out[i] = bar(in[i])
Intrinsic Functions:
The following table has the list of intrinsic functions that can be directly used inside a DK function. All the functions are equivalent to the similarly named OpenCL function. Wherever there is an implementation difference between the numba-dpex version and the OpenCL version, the difference is explained in table. Note that these functions cannot be used anywhere else outside of a DK function in a Numba application. Readers are referred to the OpenCL API specs to review the functionality of each function.
numba-dpex intrinsic
Equivalent OpenCL function
Notes
get_global_id
get_global_id
get_local_id
get_local_id
get_global_size
get_global_size
get_local_size
get_local_size
get_group_id
get_group_id
get_num_groups
get_num_groups
get_work_dim
get_work_dim
barrier
barrier
mem_fence
mem_fence
sub_group_barrier
sub_group_barrier
Does not take any argument and is equivalent to calling barrier with the CLK_LOCAL_MEM_FENCE argument.
Other Intrinsic Functions
The explicit kernel programming feature provides some additional helper/intrinsic functions that do not have a one-to-one mapping with OpenCL API functions. The following table has the list of all such currently supported functions. As with the other intrinsic functions, these functions can only be used inside a DK decorated function.
Intrinsic
Signature
Notes
print(varargs)
The print function is a subset of the OpenCL printf function. The numba-dpex version of print supports only int, string, and float arguments.
local.array
local.array(shape,dtype)
This function allow users to create local memory that’s only accessible to work items in a workgroup
Required Arguments: shape: An integer or a
tuple of integers
- dtype: Integer, float or
Numba supported NumPy dtypes
atomic.add
atomic.add(addr, value)
The atomic.add function performs an atomicrmw (read-modify-write operation) on the operand “addr” using the operand “value”.
Note that the atomic.add operation only supports integer data types.
Complete Example using @dppy.kernel:
import numpy as np import numba_dpex, numba_dpex as dppy import dpctl
@dppy.kernel def data_parallel_sum(a, b, c):
i = dppy.get_global_id(0) c[i] = a[i] + b[i]
- def driver(device_env, a, b, c, global_size):
# Copy the data to the device dA = device_env.copy_array_to_device(a) dB = device_env.copy_array_to_device(b) dC = device_env.create_device_array(c)
print(“before : “, dA._ndarray) print(“before : “, dB._ndarray) print(“before : “, dC._ndarray) data_parallel_sum[global_size, dppy.DEFAULT_LOCAL_SIZE](dA, dB, dC) device_env.copy_array_from_device(dC) print(“after : “, dC._ndarray)
- def main():
global_size = 10 N = global_size print(“N”, N)
a = np.array(np.random.random(N), dtype=np.float32) b = np.array(np.random.random(N), dtype=np.float32) c = np.ones_like(a)
- if dpctl.has_gpu_queues():
- with dpctl.device_context(“opencl:gpu”) as gpu_queue:
driver(device_env, a, b, c, global_size)
- elif dpctl.has_cpu_queues():
- with dpctl.device_context(“opencl:cpu”) as cpu_queue:
driver(device_env, a, b, c, global_size)
- else:
print(“No device found”) exit()
print(“Done…”)
- if __name__ == ‘__main__’:
main()
Automatic Offloading of Data-Parallel Regions:
We propose adding to NUMBA a new backend providing an automatic offload optimizer for @jit decorated functions. Our current proposal only considers SYCL devices. Other types of devices, e.g. CUDA, may be considered at a later point.
Complete Example with automatic offloading:
from numba import njit import numpy as np
@njit(parallel={‘offload’:True}) def f1(a, b):
c = a + b return c
- def main():
global_size = 64 local_size = 32 N = global_size * local_size print(“N”, N)
a = np.ones(N, dtype=np.float32) b = np.ones(N, dtype=np.float32)
c = f1(a,b) for i in range(N):
- if c[i] != 2.0:
print(“First index not equal to 2.0 was”, i) break
- if __name__ == ‘__main__’:
main()
Supported NumPy Functions:
Function Name
Function Type
Supported
sin
Trigonometric
Yes
cos
Trigonometric
Yes
tan
Trigonometric
Yes
arcsin
Trigonometric
Yes
arccos
Trigonometric
Yes
arctan
Trigonometric
Yes
arctan2
Trigonometric
Yes
hypot
Trigonometric
No
sinh
Trigonometric
Yes
cosh
Trigonometric
Yes
tanh
Trigonometric
Yes
arcsinh
Trigonometric
Yes
arccosh
Trigonometric
Yes
arctanh
Trigonometric
Yes
deg2rad
Trigonometric
Yes
rad2deg
Trigonometric
Yes
degrees
Trigonometric
Yes
radians
Trigonometric
Yes
isfinite
Floating Point Ops
Yes
isinf
Floating Point Ops
Yes
isnan
Floating Point Ops
Yes
signbit
Floating Point Ops
No
copysign
Floating Point Ops
No
nextafter
Floating Point Ops
No
ldexp
Floating Point Ops
No
floor
Floating Point Ops
Yes
ceil
Floating Point Ops
Yes
trunc
Floating Point Ops
Yes
spacing
Floating Point Ops
No
add
Math
Yes
subtract
Math
Yes
multiply
Math
Yes
divide
Math
Yes
logaddexp
Math
No
logaddexp2
Math
No
true_divide
Math
Yes
floor_divide
Math
No
negative
Math
Yes
power
Math
Yes
remainder
Math
Yes
mod
Math
Yes
fmod
Math
Yes
abs
Math
Yes
absolute
Math
Yes
fabs
Math
Yes
rint
Math
No
sign
Math
Yes
conj
Math
Yes
exp
Math
Yes
exp2
Math
No
log
Math
Yes
log2
Math
No
log10
Math
Yes
expm1
Math
Yes
log1p
Math
Yes
sqrt
Math
Yes
square
Math
Yes
reciprocal
Math
Yes
conjugate
Math
Yes
gcd
Math
No
lcm
Math
No
greater
Comparison
Yes
greater_equal
Comparison
Yes
less
Comparison
Yes
less_equal
Comparison
Yes
not_equal
Comparison
Yes
equal
Comparison
Yes
logical_and
Comparison
Yes
logical_or
Comparison
Yes
logical_xor
Comparison
Yes
logical_not
Comparison
Yes
maximum
Comparison
Yes
minimum
Comparison
Yes
fmax
Comparison
Yes
fmin
Comparison
Yes
bitwise_and
Bitwise Op
Yes
bitwise_or
Bitwise Op
Yes
bitwise_xor
Bitwise Op
Yes
bitwise_not
Bitwise Op
Yes
invert
Bitwise Op
Yes
left_shift
Bitwise Op
Yes
right_shift
Bitwise Op
Yes
dot
Linear Algebra
No
kron
Linear Algebra
No
outer
Linear Algebra
No
trace
Linear Algebra
No
vdot
Linear Algebra
No
cholesky
Linear Algebra
No
cond
Linear Algebra
No
det
Linear Algebra
No
eig
Linear Algebra
No
eigh
Linear Algebra
No
eigvals
Linear Algebra
No
eigvalsh
Linear Algebra
No
inv
Linear Algebra
No
lstsq
Linear Algebra
No
matrix_power
Linear Algebra
No
matrix_rank
Linear Algebra
No
norm
Linear Algebra
No
pinv
Linear Algebra
No
qr
Linear Algebra
No
slogdet
Linear Algebra
No
solve
Linear Algebra
No
svd
Linear Algebra
No
diff
Reduction
No
median
Reduction
No
nancumprod
Reduction
No
nancumsum
Reduction
No
nanmax
Reduction
No
nanmean
Reduction
No
nanmedian
Reduction
No
nanmin
Reduction
No
nanpercentile
Reduction
No
nanquantile
Reduction
No
nanprod
Reduction
No
nanstd
Reduction
No
nansum
Reduction
No
nanvar
Reduction
No
percentile
Reduction
No
quantile
Reduction
No
- numba_dpex.offload_to_sycl_device()