Workflow implementation#

Workflows in ewoks are based on networkx graphs, both in terms of runtime representation and persistent representation. At runtime, links between nodes are hash links which provide a unique identifier for each task output. This identifier is used to save and load task outputs from external storage (e.g. HDF5).

Hash implementation#

The universal hashing in ewoks is currently based on SHA-256. The UniversalHash class representation a universal hash at runtime. Several builtin python types are universally hasheable: strings, numbers, mappings, sets and iterables. Custom types that are universally hasheable should derive from UniversalHashable.

Tasks and task inputs and outputs are universally hasheable and are implementation as described in this class diagram:

classDiagram UniversalHashable <|-- Variable Variable <|-- VariableContainer Variable --o VariableContainer UniversalHashable <|-- Task Task o-- VariableContainer class UniversalHashable{ -version -class_nonce #pre_uhash #class_uhash #instance_nonce #data_uhash() uhash() UniversalHash } class Variable{ value data_proxy: DataProxy } class VariableContainer{ value: Dictionary<string|int, Variable> } class Task{ input_variables: VariableContainer output_variables: VariableContainer }

UniversalHashable#

The return value of UniversalHashable.uhash() can be either

  • the universal hash of pre_uhash and instance_nonce when instance_nonce is provided on instantiation

  • equal to pre_uhash when instance_nonce is NOT provided on instantiation

The value of UniversalHashable.pre_uhash can be either

  • provided on instantiation

  • the universal hash of UniversalHashable.class_nonce and the return value of UniversalHashable.data_uhash()

The value of UniversalHashable.class_nonce is the universal hash of

  • the class full qualifier name

  • UniversalHashable.version

  • UniversalHashable.class_nonce of the base class

Variable#

The return value of Variable.data_uhash() is Variable.value or None when hashing is disabled.

The Variable.data_proxy provides read-write access to the Variable data in external storage.

A DataProxy generates a DataUri for a root URI and a UniversalHashable (in this case a Variable).

For example when the root URI is “/tmp/dataset_name.nx?path=scan_name/task_results/var1” then the DataUri will look like this

  • .json:///tmp/dataset_name/scan_name/task_results/var1/6872c154c80bfcda0a9a769e3c1b4c85b8a56ad8d022d5c5da3ef9c036bc1e01.json

  • .nexus:///tmp/dataset_name.nx?path=scan_name/task_results/var1/6872c154c80bfcda0a9a769e3c1b4c85b8a56ad8d022d5c5da3ef9c036bc1e01

Example:#

A task which takes a single integer as input and an array as output

class MyTask(Task, input_names=["N"], output_names=["array"]):

  def run(self):
    self.outputs.array = random(self.inputs.N)

When instantiating MyTask, the following happens

self.input_variables = VariableContainer(value={"N": N})
self.output_variables = VariableContainer(value={"array": self.MISSING_DATA},
                                          pre_uhash=input_variables,
                                          instance_nonce=self.class_nonce())
self.pre_uhash = self.output_variables

The universal hash of the task is equal to the universal hash of the output container.

The input variable container instantiates this variable

input_variables["N"] = Variable(value=100000)
# N is a `Variable` (task input in this case)
# It’s value is 100000
# It’s uhash is calculated from the value

The output variable container instantiates this variable

output_variables["array"] = Variable(value=output_variables.MISSING_DATA,
                                     pre_uhash=output_variables.pre_uhash,
                                     instance_nonce=(output_variables.instance_nonce, "array"))
# array is a `Variable` (task output in this case)
# It’s value is not yet defined (set in the `run` method)
# It’s uhash is not calculated from the value but from the uhash of the task input container

This scheme ensures that the hash of a single output variable depends on all upstream inputs and does not depend on its value . The output variables take the MyTask.class_nonce() as an instance nonce to ensure that different tasks with identical upstream inputs produce.