This object encodes a Dask task graph that is composed of layers of dependent subgraphs, such as commonly occurs when building task graphs using high level collections like Dask array, bag, or dataframe.
Typically each high level array, bag, or dataframe operation takes the task graphs of the input collections, merges them, and then adds one or more new layers of tasks for the new operation. These layers typically have at least as many tasks as there are partitions or chunks in the collection. The HighLevelGraph object stores the subgraphs for each operation separately in sub-graphs, and also stores the dependency structure between them.
The subgraph layers, keyed by a unique name
The set of layers on which each layer depends
Mapping (some) keys in the high level graph to their dependencies. If a key is missing, its dependencies will be calculated on-the-fly.
Task graph composed of layers of dependent subgraphs
HighLevelGraph.from_collections
typically used by developers to make new HighLevelGraphs
Here is an idealized example that shows the internal state of a HighLevelGraph
This example is valid syntax, but we were not able to check execution>>> import dask.dataframe as ddThis example is valid syntax, but we were not able to check execution
>>> df = dd.read_csv('myfile.*.csv') # doctest: +SKIPThis example is valid syntax, but we were not able to check execution
... df = df + 100 # doctest: +SKIP
... df = df[df.name == 'Alice'] # doctest: +SKIP
>>> graph = df.__dask_graph__() # doctest: +SKIPThis example is valid syntax, but we were not able to check execution
... graph.layers # doctest: +SKIP { 'read-csv': {('read-csv', 0): (pandas.read_csv, 'myfile.0.csv'), ('read-csv', 1): (pandas.read_csv, 'myfile.1.csv'), ('read-csv', 2): (pandas.read_csv, 'myfile.2.csv'), ('read-csv', 3): (pandas.read_csv, 'myfile.3.csv')}, 'add': {('add', 0): (operator.add, ('read-csv', 0), 100), ('add', 1): (operator.add, ('read-csv', 1), 100), ('add', 2): (operator.add, ('read-csv', 2), 100), ('add', 3): (operator.add, ('read-csv', 3), 100)} 'filter': {('filter', 0): (lambda part: part[part.name == 'Alice'], ('add', 0)), ('filter', 1): (lambda part: part[part.name == 'Alice'], ('add', 1)), ('filter', 2): (lambda part: part[part.name == 'Alice'], ('add', 2)), ('filter', 3): (lambda part: part[part.name == 'Alice'], ('add', 3))} }
>>> graph.dependencies # doctest: +SKIP { 'read-csv': set(), 'add': {'read-csv'}, 'filter': {'add'} }See :
Hover to see nodes names; edges to Self not shown, Caped at 50 nodes.
Using a canvas is more power efficient and can get hundred of nodes ; but does not allow hyperlinks; , arrows or text (beyond on hover)
SVG is more flexible but power hungry; and does not scale well to 50 + nodes.
All aboves nodes referred to, (or are referred from) current nodes; Edges from Self to other have been omitted (or all nodes would be connected to the central node "self" which is not useful). Nodes are colored by the library they belong to, and scaled with the number of references pointing them