Scikit-Learn API and patching
Python interface to efficient Intel(R) oneAPI Data Analytics Library provided by daal4py allows one to create scikit-learn compatible estimators, transformers, clusterers, etc. powered by oneDAL which are nearly as efficient as native programs.
Deprecation Notice
Scikit-learn patching functionality in daal4py was deprecated and moved to a separate package, Intel(R) Extension for Scikit-learn*. All future patches will be available only in Intel(R) Extension for Scikit-learn*. Please use the scikit-learn-intelex package instead of daal4py for the scikit-learn acceleration.
oneDAL accelerated scikit-learn
daal4py can dynamically patch scikit-learn estimators to use Intel(R) oneAPI Data Analytics Library as the underlying solver, while getting the same solution faster.
It is possible to enable those patches without editing the code of a scikit-learn application by using the following commandline flag:
python -m daal4py my_application.py
If you are using Scikit-Learn from Intel® Distribution for Python, then
you can enable daal4py patches through an environment variable. To do this, set USE_DAAL4PY_SKLEARN
to one of the values
True
, '1'
, 'y'
, 'yes'
, 'Y'
, 'YES'
, 'Yes'
, 'true'
, 'True'
or 'TRUE'
as shown below.
On Linux and Mac OS:
export USE_DAAL4PY_SKLEARN=1
On Windows:
set USE_DAAL4PY_SKLEARN=1
To disable daal4py patches, set the USE_DAAL4PY_SKLEARN
environment variable to 0.
Patches can also be enabled programmatically:
import daal4py.sklearn
daal4py.sklearn.patch_sklearn()
It is possible to undo the patch with:
daal4py.sklearn.unpatch_sklearn()
Applying the monkey patch will impact the following existing scikit-learn algorithms:
Task |
Functionality |
Parameters support |
Data support |
---|---|---|---|
Classification |
SVC |
All parameters except |
No limitations. |
Classification |
RandomForestClassifier |
All parameters except |
Multi-output, sparse data and out-of-bag score are not supported. |
Classification |
KNeighborsClassifier |
All parameters except |
Multi-output and sparse data is not supported. |
Classification |
LogisticRegression |
All parameters except |
Only dense data is supported. |
Regression |
RandomForestRegressor |
All parameters except |
Multi-output, sparse data and out-of-bag score are not supported. |
Regression |
KNeighborsRegressor |
All parameters except |
Multi-output and sparse data is not supported. |
Regression |
LinearRegression |
All parameters except |
Only dense data is supported, #observations should be >= #features. |
Regression |
Ridge |
All parameters except |
Only dense data is supported, #observations should be >= #features. |
Regression |
ElasticNet |
All parameters except |
Multi-output and sparse data is not supported, #observations should be >= #features. |
Regression |
Lasso |
All parameters except |
Multi-output and sparse data is not supported, #observations should be >= #features. |
Clustering |
KMeans |
All parameters except |
No limitations. |
Clustering |
DBSCAN |
All parameters except |
Only dense data is supported. |
Dimensionality reduction |
PCA |
All parameters except |
Sparse data is not supported. |
Unsupervised |
NearestNeighbors |
All parameters except |
Sparse data is not supported. |
Other |
train_test_split |
All parameters are supported. |
Only dense data is supported. |
Other |
assert_all_finite |
All parameters are supported. |
Only dense data is supported. |
Other |
pairwise_distance |
With metric=``cosine`` and |
Only dense data is supported. |
Other |
roc_auc_score |
Parameters |
No limitations. |
Monkey-patched scikit-learn classes and functions passes scikit-learn’s own test suite, with few exceptions, specified in deselected_tests.yaml.
In particular the tests execute check_estimator on all added and monkey-patched classes, which are discovered by means of introspection. This assures scikit-learn API compatibility of all daal4py.sklearn classes.
Note
daal4py supports optimizations for the last four versions of scikit-learn. The latest release of daal4py-2021.1 supports scikit-learn 0.21.X, 0.22.X, 0.23.X and 0.24.X.
scikit-learn verbose
To find out which implementation of the algorithm is currently used, set the environment variable.
On Linux and Mac OS:
export IDP_SKLEARN_VERBOSE=INFO
On Windows:
set IDP_SKLEARN_VERBOSE=INFO
During the calls that use Intel-optimized scikit-learn, you will receive additional print statements that indicate which implementation is being called. These print statements are only available for scikit-learn algorithms with daal4py patches.
For example, for DBSCAN you get one of these print statements depending on which implementation is used:
INFO: sklearn.cluster.DBSCAN.fit: running accelerated version on CPU
INFO: sklearn.cluster.DBSCAN.fit: fallback to original Scikit-learn
scikit-learn API
The daal4py.sklearn
package contains scikit-learn compatible API which
implement a subset of scikit-learn algorithms using Intel(R) oneAPI Data Analytics Library.
Currently, these include:
daal4py.sklearn.neighbors.KNeighborsClassifier
daal4py.sklearn.neighbors.KNeighborsRegressor
daal4py.sklearn.neighbors.NearestNeighbors
daal4py.sklearn.tree.DecisionTreeClassifier
daal4py.sklearn.ensemble.RandomForestClassifier
daal4py.sklearn.ensemble.RandomForestRegressor
daal4py.sklearn.ensemble.AdaBoostClassifier
daal4py.sklearn.cluster.KMeans
daal4py.sklearn.cluster.DBSCAN
daal4py.sklearn.decomposition.PCA
daal4py.sklearn.linear_model.Ridge
daal4py.sklearn.svm.SVC
daal4py.sklearn.linear_model.logistic_regression_path
daal4py.sklearn.linear_model.LogisticRegression
daal4py.sklearn.linear_model.ElasticNet
daal4py.sklearn.linear_model.Lasso
daal4py.sklearn.model_selection._daal_train_test_split
daal4py.sklearn.metrics._daal_roc_auc_score
These classes are always available, whether the scikit-learn itself has been patched, or not. For example:
import daal4py.sklearn
daal4py.sklearn.unpatch_sklearn()
import sklearn.datasets, sklearn.svm
digits = sklearn.datasets.load_digits()
X, y = digits.data, digits.target
clf_d = daal4py.sklearn.svm.SVC(kernel='rbf', gamma='scale', C = 0.5).fit(X, y)
clf_v = sklearn.svm.SVC(kernel='rbf', gamma='scale', C =0.5).fit(X, y)
clf_d.score(X, y) # output: 0.9905397885364496
clf_v.score(X, y) # output: 0.9905397885364496