dask 2021.10.0

BackRef

To remove in the future –– dask.order

Static order of nodes in dask graph

Dask makes decisions on what tasks to prioritize both

Dynamically we prefer to run tasks that were just made available. However when several tasks become available at the same time we have an opportunity to break ties in an intelligent way

d |

b c

\ /

a

For example after we finish a we can choose to run either b or c next. Making small decisions like this can greatly affect our performance, especially because the order in which we run tasks affects the order in which we can release memory, which operationally we find to have a large affect on many computation. We want to run tasks in such a way that we keep only a small amount of data in memory at any given time.

Static Ordering

And so we create a total ordering over all nodes to serve as a tie breaker. We represent this ordering with a dictionary mapping keys to integer values. Lower scores have higher priority. These scores correspond to the order in which a sequential scheduler would visit each node.

{'a': 0,

'c': 1, 'd': 2, 'b': 3}

There are several ways in which we might order our keys. This is a nuanced process that has to take into account many different kinds of workflows, and operate efficiently in linear time. We strongly recommend that readers look at the docstrings of tests in dask/tests/test_order.py. These tests usually have graph types laid out very carefully to show the kinds of situations that often arise, and the order we would like to be determined.

Policy

Work towards small goals with big steps.

  1. Small goals: prefer tasks that have few total dependents and whose final dependents have few total dependencies.

    We prefer to prioritize those tasks that help branches of computation that can terminate quickly.

    With more detail, we compute the total number of dependencies that each task depends on (both its own dependencies, and the dependencies of its dependencies, and so on), and then we choose those tasks that drive towards results with a low number of total dependencies. We choose to prioritize tasks that work towards finishing shorter computations first.

  2. Big steps: prefer tasks with many dependents

    However, many tasks work towards the same final dependents. Among those, we choose those tasks with the most work left to do. We want to finish the larger portions of a sub-computation before we start on the smaller ones.

  3. Name comparison: break ties with key name

    Often graphs are made with regular keynames. When no other structural difference exists between two keys, use the key name to break ties. This relies on the regularity of graph constructors like dask.array to be a good proxy for ordering. This is usually a good idea and a sane default.

Examples

See :

Back References

The following pages refer to to this document either explicitly or contain code examples using this.

dask.array.routines.ravel dask.array.routines.array

Local connectivity graph

Hover to see nodes names; edges to Self not shown, Caped at 50 nodes.

Using a canvas is more power efficient and can get hundred of nodes ; but does not allow hyperlinks; , arrows or text (beyond on hover)

SVG is more flexible but power hungry; and does not scale well to 50 + nodes.

All aboves nodes referred to, (or are referred from) current nodes; Edges from Self to other have been omitted (or all nodes would be connected to the central node "self" which is not useful). Nodes are colored by the library they belong to, and scaled with the number of references pointing them


File: /dask/order.py#0
type: <class 'module'>
Commit: