Streaming Data

Note

Scikit-learn patching functionality in daal4py was deprecated and moved to a separate package, Intel(R) Extension for Scikit-learn*. All future patches will be available only in Intel(R) Extension for Scikit-learn*. Use the scikit-learn-intelex package instead of daal4py for the scikit-learn acceleration.

For large quantities of data it might be impossible to provide all input data at once. This might be because the data resides in multiple files and merging it is to costly (or not feasible in other ways). In other cases the data is simply too large to be loaded completely into memory. Or, the data might come in as an actual stream. daal4py’s streaming mode allows you to process such data.

Besides supporting certain use cases, streaming also allows interleaving I/O operations with computation.

daal4py’s streaming mode is as easy as follows:

  1. When constructing the algorithm configure it with streaming=True:

    algo = daal4py.svd(streaming=True)
    
  2. Repeat calling compute(input-data) with chunks of your input (arrays, DataFrames or files):

    for f in input_files:
        algo.compute(f)
    
  3. When done with inputting, call finalize() to obtain the result:

    result = algo.finalize()
    

The streaming algorithms also accept arrays and DataFrames as input, e.g. the data can come from a stream rather than from multiple files. Here is an example which simulates a data stream using a generator which reads a file in chunks: SVD reading stream of data

Supported Algorithms and Examples

The following algorithms support streaming: