histogramdd(sample, bins, range=None, normed=None, weights=None, density=None)
Chunking of the input data ( sample
) is only allowed along the 0th (row) axis (the axis corresponding to the total number of samples). Data chunked along the 1st axis (column) axis is not compatible with this function. If weights are used, they must be chunked along the 0th axis identically to the input sample.
An example setup for a three dimensional histogram, where the sample shape is (8, 3)
and weights are shape (8,)
, sample chunks would be ((4, 4), (3,))
and the weights chunks would be ((4, 4),)
a table of the structure:
+-------+-----------------------+-----------+ | | sample (8 x 3) | weights | +=======+=====+=====+=====+=====+=====+=====+ | chunk | row | :None:None:`x`
| y
| :None:None:`z`
| row | w
| +-------+-----+-----+-----+-----+-----+-----+ | | 0 | 5 | 6 | 6 | 0 | 0.5 | | +-----+-----+-----+-----+-----+-----+ | | 1 | 8 | 9 | 2 | 1 | 0.8 | | 0 +-----+-----+-----+-----+-----+-----+ | | 2 | 3 | 3 | 1 | 2 | 0.3 | | +-----+-----+-----+-----+-----+-----+ | | 3 | 2 | 5 | 6 | 3 | 0.7 | +-------+-----+-----+-----+-----+-----+-----+ | | 4 | 3 | 1 | 1 | 4 | 0.3 | | +-----+-----+-----+-----+-----+-----+ | | 5 | 3 | 2 | 9 | 5 | 1.3 | | 1 +-----+-----+-----+-----+-----+-----+ | | 6 | 8 | 1 | 5 | 6 | 0.8 | | +-----+-----+-----+-----+-----+-----+ | | 7 | 3 | 5 | 3 | 7 | 0.7 | +-------+-----+-----+-----+-----+-----+-----+
If the sample 0th dimension and weight 0th (row) dimension are chunked differently, a ValueError
will be raised. If coordinate groupings ((x, y, z) trios) are separated by a chunk boundry, then a ValueError
will be raised. We suggest that you rechunk your data if it is of that form.
The chunks property of the data (and optional weights) are used to check for compatibility with the blocked algorithm (as described above); therefore, you must call :None:None:`to_dask_array`
on a collection from dask.dataframe
, i.e. dask.dataframe.Series
or dask.dataframe.DataFrame
.
The function is also compatible with :None:None:`x`
, y
, and :None:None:`z`
being individual 1D arrays with equal chunking. In that case, the data should be passed as a tuple: histogramdd((x, y, z), ...)
Multidimensional data to be histogrammed.
Note the unusual interpretation of a sample when it is a sequence of dask Arrays:
When a (N, D) dask Array, each row is an entry in the sample (coordinate in D dimensional space).
When a sequence of dask Arrays, each element in the sequence is the array of values for a single coordinate.
The bin specification.
The possible binning configurations are:
A sequence of arrays describing the monotonically increasing bin edges along each dimension.
A single int describing the total number of bins that will be used in each dimension (this requires the range
argument to be defined).
A sequence of ints describing the total number of bins to be used in each dimension (this requires the range
argument to be defined).
When bins are described by arrays, the rightmost edge is included. Bins described by arrays also allows for non-uniform bin widths.
A sequence of length D, each a (min, max) tuple giving the outer bin edges to be used if the edges are not given explicitly in :None:None:`bins`
. If defined, this argument is required to have an entry for each dimension. Unlike numpy.histogramdd
, if :None:None:`bins`
does not define bin edges, this argument is required (this function will not automatically use the min and max of of the value in a given dimension because the input data may be lazy in dask).
An alias for the density argument that behaves identically. To avoid confusion with the broken argument to histogram
, :None:None:`density`
should be preferred.
An array of values weighing each sample in the input data. The chunks of the weights must be identical to the chunking along the 0th (row) axis of the data sample.
If False
(default), the returned array represents the number of samples in each bin. If True
, the returned array represents the probability density function at each bin.
The values of the histogram.
Sequence of arrays representing the bin edges along each dimension.
Blocked variant of numpy.histogramdd
.
Computing the histogram in 5 blocks using different bin edges along each dimension:
This example is valid syntax, but we were not able to check execution>>> import dask.array as da
... x = da.random.uniform(0, 1, size=(1000, 3), chunks=(200, 3))
... edges = [
... np.linspace(0, 1, 5), # 4 bins in 1st dim
... np.linspace(0, 1, 6), # 5 in the 2nd
... np.linspace(0, 1, 4), # 3 in the 3rd
... ]
... h, edges = da.histogramdd(x, bins=edges)
... result = h.compute()
... result.shape (4, 5, 3)
Defining the bins by total number and their ranges, along with using weights:
This example is valid syntax, but we were not able to check execution>>> bins = (4, 5, 3)
... ranges = ((0, 1),) * 3 # expands to ((0, 1), (0, 1), (0, 1))
... w = da.random.uniform(0, 1, size=(1000,), chunks=x.chunksize[0])
... h, edges = da.histogramdd(x, bins=bins, range=ranges, weights=w)
... np.isclose(h.sum().compute(), w.sum().compute()) True
Using a sequence of 1D arrays as the input:
This example is valid syntax, but we were not able to check execution>>> x = da.array([2, 4, 2, 4, 2, 4])This example is valid syntax, but we were not able to check execution
... y = da.array([2, 2, 4, 4, 2, 4])
... z = da.array([4, 2, 4, 2, 4, 2])
... bins = ([0, 3, 6],) * 3
... h, edges = da.histogramdd((x, y, z), bins)
... h dask.array<sum-aggregate, shape=(2, 2, 2), dtype=float64, chunksize=(2, 2, 2), chunktype=numpy.ndarray>
>>> edges[0] dask.array<array, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>This example is valid syntax, but we were not able to check execution
>>> h.compute() array([[[0., 2.], [0., 1.]], <BLANKLINE> [[1., 0.], [2., 0.]]])This example is valid syntax, but we were not able to check execution
>>> edges[0].compute() array([0, 3, 6])This example is valid syntax, but we were not able to check execution
>>> edges[1].compute() array([0, 3, 6])This example is valid syntax, but we were not able to check execution
>>> edges[2].compute() array([0, 3, 6])See :
The following pages refer to to this document either explicitly or contain code examples using this.
dask.array.routines.histogram2d
dask.array.routines.histogramdd
Hover to see nodes names; edges to Self not shown, Caped at 50 nodes.
Using a canvas is more power efficient and can get hundred of nodes ; but does not allow hyperlinks; , arrows or text (beyond on hover)
SVG is more flexible but power hungry; and does not scale well to 50 + nodes.
All aboves nodes referred to, (or are referred from) current nodes; Edges from Self to other have been omitted (or all nodes would be connected to the central node "self" which is not useful). Nodes are colored by the library they belong to, and scaled with the number of references pointing them