from_array(x, chunks='auto', name=None, lock=False, asarray=None, fancy=True, getitem=None, meta=None, inline_array=False)
Input must have a .shape
, .ndim
, .dtype
and support numpy-style slicing.
How to chunk the array. Must be one of the following forms:
A blocksize like 1000.
A blockshape like (1000, 1000).
Explicit sizes of all blocks along all dimensions like ((1000, 1000, 500), (400, 400)).
A size in bytes, like "100 MiB" which will choose a uniform block-like shape
The word "auto" which acts like the above, but uses a configuration value array.chunk-size
for the chunk size
-1 or None as a blocksize indicate the size of the corresponding dimension.
The key name to use for the array. Defaults to a hash of x
.
Hashing is useful if the same value of x
is used to create multiple arrays, as Dask can then recognise that they're the same and avoid duplicate computations. However, it can also be slow, and if the array is not contiguous it is copied for hashing. If the array uses stride tricks (such as numpy.broadcast_to
or skimage.util.view_as_windows
) to have a larger logical than physical size, this copy can cause excessive memory usage.
If you don't need the deduplication provided by hashing, use name=False
to generate a random name instead of hashing, which avoids the pitfalls described above. Using name=True
is equivalent to the default.
By default, hashing uses python's standard sha1. This behaviour can be changed by installing cityhash, xxhash or murmurhash. If installed, a large-factor speedup can be obtained in the tokenisation step.
Because this name
is used as the key in task graphs, you should ensure that it uniquely identifies the data contained within. If you'd like to provide a descriptive name that is still unique, combine the descriptive name with :None:func:`dask.base.tokenize`
of the array_like
. See :None:ref:`graphs`
for more.
If x
doesn't support concurrent reads then provide a lock here, or pass in True to have dask.array create one for you.
If True then call np.asarray on chunks to convert them to numpy arrays. If False then chunks are passed through unchanged. If None (default) then we use True if the __array_function__
method is undefined.
If x
doesn't support fancy indexing (e.g. indexing with lists or arrays) then set to False. Default is True.
The metadata for the resulting dask array. This is the kind of array that will result from slicing the input array. Defaults to the input array.
How to include the array in the task graph. By default ( inline_array=False
) the array is included in a task by itself, and each chunk refers to that task by its key.
.. code-block:: python >>> x = h5py.File("data.h5")["/x"] # doctest: +SKIP >>> a = da.from_array(x, chunks=500) # doctest: +SKIP >>> dict(a.dask) # doctest: +SKIP { 'array-original-<name>': <HDF5 dataset ...>, ('array-<name>', 0): (getitem, "array-original-<name>", ...), ('array-<name>', 1): (getitem, "array-original-<name>", ...) }
With inline_array=True
, Dask will instead inline the array directly in the values of the task graph.
.. code-block:: python >>> a = da.from_array(x, chunks=500, inline_array=True) # doctest: +SKIP >>> dict(a.dask) # doctest: +SKIP { ('array-<name>', 0): (getitem, <HDF5 dataset ...>, ...), ('array-<name>', 1): (getitem, <HDF5 dataset ...>, ...) }
Note that there's no key in the task graph with just the array x
anymore. Instead it's placed directly in the values.
The right choice for inline_array
depends on several factors, including the size of x
, how expensive it is to create, which scheduler you're using, and the pattern of downstream computations. As a heuristic, inline_array=True
may be the right choice when the array x
is cheap to serialize and deserialize (since it's included in the graph many times) and if you're experiencing ordering issues (see order
for more).
This has no effect when x
is a NumPy array.
Create dask array from something that looks like an array.
>>> x = h5py.File('...')['/data/path'] # doctest: +SKIP
... a = da.from_array(x, chunks=(1000, 1000)) # doctest: +SKIP
If your underlying datastore does not support concurrent reads then include the lock=True
keyword argument or lock=mylock
if you want multiple arrays to coordinate around the same lock.
>>> a = da.from_array(x, chunks=(1000, 1000), lock=True) # doctest: +SKIP
If your underlying datastore has a .chunks
attribute (as h5py and zarr datasets do) then a multiple of that chunk shape will be used if you do not provide a chunk shape.
>>> a = da.from_array(x, chunks='auto') # doctest: +SKIP
... a = da.from_array(x, chunks='100 MiB') # doctest: +SKIP
... a = da.from_array(x) # doctest: +SKIP
If providing a name, ensure that it is unique
This example is valid syntax, but we were not able to check execution>>> import dask.base
... token = dask.base.tokenize(x) # doctest: +SKIP
... a = da.from_array('myarray-' + token) # doctest: +SKIP
NumPy ndarrays are eagerly sliced and then embedded in the graph.
This example is valid syntax, but we were not able to check execution>>> import dask.array
... a = dask.array.from_array(np.array([[1, 2], [3, 4]]), chunks=(1,1))
... a.dask[a.name, 0, 0][0] array([1])
Chunks with exactly-specified, different sizes can be created.
This example is valid syntax, but we were not able to check execution>>> import numpy as npSee :
... import dask.array as da
... x = np.random.random((100, 6))
... a = da.from_array(x, chunks=((67, 33), (6,)))
The following pages refer to to this document either explicitly or contain code examples using this.
dask.array.core.Array.compute_chunk_sizes
dask.array.reductions.topk
dask.array.core.stack
dask.array.routines.histogram
dask.array.core.concatenate
dask.array.blockwise.blockwise
dask.array.core.Array
dask.array.core.from_zarr
dask.array.core.Array.map_overlap
dask.array.core.from_array
dask.array.overlap.map_overlap
dask.array.reductions.argtopk
dask.array.overlap.overlap
Hover to see nodes names; edges to Self not shown, Caped at 50 nodes.
Using a canvas is more power efficient and can get hundred of nodes ; but does not allow hyperlinks; , arrows or text (beyond on hover)
SVG is more flexible but power hungry; and does not scale well to 50 + nodes.
All aboves nodes referred to, (or are referred from) current nodes; Edges from Self to other have been omitted (or all nodes would be connected to the central node "self" which is not useful). Nodes are colored by the library they belong to, and scaled with the number of references pointing them