Streaming Data
Note
Scikit-learn patching functionality in daal4py was deprecated and moved to a separate package, Intel(R) Extension for Scikit-learn*. All future patches will be available only in Intel(R) Extension for Scikit-learn*. Use the scikit-learn-intelex package instead of daal4py for the scikit-learn acceleration.
For large quantities of data it might be impossible to provide all input data at once. This might be because the data resides in multiple files and merging it is to costly (or not feasible in other ways). In other cases the data is simply too large to be loaded completely into memory. Or, the data might come in as an actual stream. daal4py’s streaming mode allows you to process such data.
Besides supporting certain use cases, streaming also allows interleaving I/O operations with computation.
daal4py’s streaming mode is as easy as follows:
When constructing the algorithm configure it with
streaming=True
:algo = daal4py.svd(streaming=True)
Repeat calling
compute(input-data)
with chunks of your input (arrays, DataFrames or files):for f in input_files: algo.compute(f)
When done with inputting, call
finalize()
to obtain the result:result = algo.finalize()
The streaming algorithms also accept arrays and DataFrames as input, e.g. the data can come from a stream rather than from multiple files. Here is an example which simulates a data stream using a generator which reads a file in chunks: SVD reading stream of data
Supported Algorithms and Examples
The following algorithms support streaming:
SVD (svd)
Linear Regression Training (linear_regression_training)
Ridge Regression Training (ridge_regression_training)
Multinomial Naive Bayes Training (multinomial_naive_bayes_training)
Moments of Low Order
Covariance
QR