dask 2021.10.0

ParametersBackRef
from_array(x, chunks='auto', name=None, lock=False, asarray=None, fancy=True, getitem=None, meta=None, inline_array=False)

Input must have a .shape , .ndim , .dtype and support numpy-style slicing.

Parameters

x : array_like
chunks : int, tuple

How to chunk the array. Must be one of the following forms:

  • A blocksize like 1000.

  • A blockshape like (1000, 1000).

  • Explicit sizes of all blocks along all dimensions like ((1000, 1000, 500), (400, 400)).

  • A size in bytes, like "100 MiB" which will choose a uniform block-like shape

  • The word "auto" which acts like the above, but uses a configuration value array.chunk-size for the chunk size

-1 or None as a blocksize indicate the size of the corresponding dimension.

name : str or bool, optional

The key name to use for the array. Defaults to a hash of x .

Hashing is useful if the same value of x is used to create multiple arrays, as Dask can then recognise that they're the same and avoid duplicate computations. However, it can also be slow, and if the array is not contiguous it is copied for hashing. If the array uses stride tricks (such as numpy.broadcast_to or skimage.util.view_as_windows ) to have a larger logical than physical size, this copy can cause excessive memory usage.

If you don't need the deduplication provided by hashing, use name=False to generate a random name instead of hashing, which avoids the pitfalls described above. Using name=True is equivalent to the default.

By default, hashing uses python's standard sha1. This behaviour can be changed by installing cityhash, xxhash or murmurhash. If installed, a large-factor speedup can be obtained in the tokenisation step.

note

Because this name is used as the key in task graphs, you should ensure that it uniquely identifies the data contained within. If you'd like to provide a descriptive name that is still unique, combine the descriptive name with :None:func:`dask.base.tokenize` of the array_like . See :None:ref:`graphs` for more.

lock : bool or Lock, optional

If x doesn't support concurrent reads then provide a lock here, or pass in True to have dask.array create one for you.

asarray : bool, optional

If True then call np.asarray on chunks to convert them to numpy arrays. If False then chunks are passed through unchanged. If None (default) then we use True if the __array_function__ method is undefined.

fancy : bool, optional

If x doesn't support fancy indexing (e.g. indexing with lists or arrays) then set to False. Default is True.

meta : Array-like, optional

The metadata for the resulting dask array. This is the kind of array that will result from slicing the input array. Defaults to the input array.

inline_array : bool, default False

How to include the array in the task graph. By default ( inline_array=False ) the array is included in a task by itself, and each chunk refers to that task by its key.

.. code-block:: python
    >>> x = h5py.File("data.h5")["/x"]  # doctest: +SKIP
    >>> a = da.from_array(x, chunks=500)  # doctest: +SKIP
    >>> dict(a.dask)  # doctest: +SKIP
    {
       'array-original-<name>': <HDF5 dataset ...>,
       ('array-<name>', 0): (getitem, "array-original-<name>", ...),
       ('array-<name>', 1): (getitem, "array-original-<name>", ...)
    }

With inline_array=True , Dask will instead inline the array directly in the values of the task graph.

.. code-block:: python
    >>> a = da.from_array(x, chunks=500, inline_array=True)  # doctest: +SKIP
    >>> dict(a.dask)  # doctest: +SKIP
    {
       ('array-<name>', 0): (getitem, <HDF5 dataset ...>, ...),
       ('array-<name>', 1): (getitem, <HDF5 dataset ...>, ...)
    }

Note that there's no key in the task graph with just the array x anymore. Instead it's placed directly in the values.

The right choice for inline_array depends on several factors, including the size of x , how expensive it is to create, which scheduler you're using, and the pattern of downstream computations. As a heuristic, inline_array=True may be the right choice when the array x is cheap to serialize and deserialize (since it's included in the graph many times) and if you're experiencing ordering issues (see order for more).

This has no effect when x is a NumPy array.

Create dask array from something that looks like an array.

Examples

This example is valid syntax, but we were not able to check execution
>>> x = h5py.File('...')['/data/path']  # doctest: +SKIP
... a = da.from_array(x, chunks=(1000, 1000)) # doctest: +SKIP

If your underlying datastore does not support concurrent reads then include the lock=True keyword argument or lock=mylock if you want multiple arrays to coordinate around the same lock.

This example is valid syntax, but we were not able to check execution
>>> a = da.from_array(x, chunks=(1000, 1000), lock=True)  # doctest: +SKIP

If your underlying datastore has a .chunks attribute (as h5py and zarr datasets do) then a multiple of that chunk shape will be used if you do not provide a chunk shape.

This example is valid syntax, but we were not able to check execution
>>> a = da.from_array(x, chunks='auto')  # doctest: +SKIP
... a = da.from_array(x, chunks='100 MiB') # doctest: +SKIP
... a = da.from_array(x) # doctest: +SKIP

If providing a name, ensure that it is unique

This example is valid syntax, but we were not able to check execution
>>> import dask.base
... token = dask.base.tokenize(x) # doctest: +SKIP
... a = da.from_array('myarray-' + token) # doctest: +SKIP

NumPy ndarrays are eagerly sliced and then embedded in the graph.

This example is valid syntax, but we were not able to check execution
>>> import dask.array
... a = dask.array.from_array(np.array([[1, 2], [3, 4]]), chunks=(1,1))
... a.dask[a.name, 0, 0][0] array([1])

Chunks with exactly-specified, different sizes can be created.

This example is valid syntax, but we were not able to check execution
>>> import numpy as np
... import dask.array as da
... x = np.random.random((100, 6))
... a = da.from_array(x, chunks=((67, 33), (6,)))
See :

Back References

The following pages refer to to this document either explicitly or contain code examples using this.

dask.array.core.Array.compute_chunk_sizes dask.array.reductions.topk dask.array.core.stack dask.array.routines.histogram dask.array.core.concatenate dask.array.blockwise.blockwise dask.array.core.Array dask.array.core.from_zarr dask.array.core.Array.map_overlap dask.array.core.from_array dask.array.overlap.map_overlap dask.array.reductions.argtopk dask.array.overlap.overlap

Local connectivity graph

Hover to see nodes names; edges to Self not shown, Caped at 50 nodes.

Using a canvas is more power efficient and can get hundred of nodes ; but does not allow hyperlinks; , arrows or text (beyond on hover)

SVG is more flexible but power hungry; and does not scale well to 50 + nodes.

All aboves nodes referred to, (or are referred from) current nodes; Edges from Self to other have been omitted (or all nodes would be connected to the central node "self" which is not useful). Nodes are colored by the library they belong to, and scaled with the number of references pointing them


File: /dask/array/core.py#3025
type: <class 'function'>
Commit: