Document

dask 2021.10.0

This object encodes a Dask task graph that is composed of layers of dependent subgraphs, such as commonly occurs when building task graphs using high level collections like Dask array, bag, or dataframe.

Typically each high level array, bag, or dataframe operation takes the task graphs of the input collections, merges them, and then adds one or more new layers of tasks for the new operation. These layers typically have at least as many tasks as there are partitions or chunks in the collection. The HighLevelGraph object stores the subgraphs for each operation separately in sub-graphs, and also stores the dependency structure between them.

Parameters

layers : Mapping[str, Mapping]: The subgraph layers, keyed by a unique name
dependencies : Mapping[str, set[str]]: The set of layers on which each layer depends
key_dependencies : Mapping[Hashable, set], optional: Mapping (some) keys in the high level graph to their dependencies. If a key is missing, its dependencies will be calculated on-the-fly.

Task graph composed of layers of dependent subgraphs

Examples

Here is an idealized example that shows the internal state of a HighLevelGraph

This example is valid syntax, but we were not able to check execution

>>> import dask.dataframe as dd

This example is valid syntax, but we were not able to check execution

>>> df = dd.read_csv('myfile.*.csv')  # doctest: +SKIP
... df = df + 100  # doctest: +SKIP
... df = df[df.name == 'Alice']  # doctest: +SKIP

This example is valid syntax, but we were not able to check execution

>>> graph = df.__dask_graph__()  # doctest: +SKIP
... graph.layers  # doctest: +SKIP
{
 'read-csv': {('read-csv', 0): (pandas.read_csv, 'myfile.0.csv'),
              ('read-csv', 1): (pandas.read_csv, 'myfile.1.csv'),
              ('read-csv', 2): (pandas.read_csv, 'myfile.2.csv'),
              ('read-csv', 3): (pandas.read_csv, 'myfile.3.csv')},
 'add': {('add', 0): (operator.add, ('read-csv', 0), 100),
         ('add', 1): (operator.add, ('read-csv', 1), 100),
         ('add', 2): (operator.add, ('read-csv', 2), 100),
         ('add', 3): (operator.add, ('read-csv', 3), 100)}
 'filter': {('filter', 0): (lambda part: part[part.name == 'Alice'], ('add', 0)),
            ('filter', 1): (lambda part: part[part.name == 'Alice'], ('add', 1)),
            ('filter', 2): (lambda part: part[part.name == 'Alice'], ('add', 2)),
            ('filter', 3): (lambda part: part[part.name == 'Alice'], ('add', 3))}
}

This example is valid syntax, but we were not able to check execution

>>> graph.dependencies  # doctest: +SKIP
{
 'read-csv': set(),
 'add': {'read-csv'},
 'filter': {'add'}
}

See :

Local connectivity graph

Hover to see nodes names; edges to Self not shown, Caped at 50 nodes.

Using a canvas is more power efficient and can get hundred of nodes ; but does not allow hyperlinks; , arrows or text (beyond on hover)

SVG is more flexible but power hungry; and does not scale well to 50 + nodes.

All aboves nodes referred to, (or are referred from) current nodes; Edges from Self to other have been omitted (or all nodes would be connected to the central node "self" which is not useful). Nodes are colored by the library they belong to, and scaled with the number of references pointing them

File: /dask/highlevelgraph.py#559
type: <class 'abc.ABCMeta'>
Commit:

Parameters

See Also

Examples

Local connectivity graph