How Azure Durable Functions scale

Joonas Westlin | 04.06.2020
Reading time 5 min

If you haven’t yet read about Azure Durable Functions, they essentially allow you to build serverless workflows using C# or Javascript. So instead of a more graphical and declarative approach of e.g. Logic Apps, you use an imperative approach through code.

Since Durable Functions run on Azure Functions, running it on a Consumption Plan should allow it to scale essentially infinitely. But does it?

Short answer: Yes and no.

If you want the TL;DR, scroll down to the summary 🙂 On the other hand if you want the deep dive, let’s go!

Activity functions scale

Activities are essential steps in your workflow. All of the heavy liftings is done by activities. They are allowed to do almost anything a regular Azure Function can do: output bindings, API calls, database queries, etc. They are where you run the non-deterministic parts of your workflow; the parts that cannot run in an orchestrator function.

All activity functions in your Function app are triggered by messages from a single Storage Queue (the “work item queue”). Instances compete for messages from this queue, and whichever instance gets the message runs the activity.

This means that the number of instances that can run activities is essentially unlimited. You could have 1 instance consuming messages from the queue, or 1000 instances.

A single instance can also run activities in parallel on multiple threads. It is controlled with a setting in host.json:

{
  "extensions": {
    "durableTask": {
      "maxConcurrentActivityFunctions": 10
    }
  }
}

The default for this setting is 10 x number of cores on the instance. So if the instance has 4 CPU cores, max 40 activity functions can run concurrently on the instance. But do remember that each running activity will use some of the shared resources on the instance. Personally, I would only increase this after doing performance testing.

The Azure Functions scale controller monitors the latency of peeking messages from the work item queue and increases/decreases the number of instances as needed. Since activities are stateless, this kind of scale is possible. This is not the case for orchestrators and entities though.

Orchestrator and entity functions scale

Orchestrator functions are what define your workflow, and can call activities and sub-orchestrators. They also require that their code is deterministic due to the fact that their code is replayed as the workflow progresses.

Entity functions on the other hand implement an actor-like model by allowing creation and modification of stateful, addressable entity instances. They can be used for various purposes, for example, aggregating data from various sources. They also have a guarantee that all requests to them are processed in series, removing the need to implement thread safety in the entity code.

Orchestrators and entities scale in a similar way. They are stateful singletons, and so cannot scale to an infinite amount of instances. It should never be possible for more than one Functions instance to run an orchestrator/entity instance and update its state. Since the history table is only appended to, it would be disastrous if duplicate events appeared there from multiple Functions instances running the same orchestrator instance.

Orchestrator and entity functions are triggered through messages in a control queue. There are N of these queues that your Durable Function uses. Their number depends on the number of partitions. The default is 4, but it can be changed through host.json:

{
  "extensions": {
    "durableTask": {
      "storageProvider": {
        "partitionCount": 4
      }
    }
  }
}

This setting means there will be 4 queues. Only a single Functions instance can read messages from a control queue. The framework uses leases on blobs in a “task-hub-name-leases” blob container to control what instance can read a particular queue. This is what enforces that only a single Functions instance can read a single queue at a time. So there might be a single instance reading all 4 queues, or at most 4 instances reading a single queue each.

The control queue messages are used to start orchestrators, deliver activity/sub-orchestrator results, trigger scheduled execution, etc. The instance id of the orchestrator is used to decide which queue takes the message. Thus the good distribution of instance ids will help your throughput. By default, a GUID is generated without dashes and used as the id which should give that good distribution. If you specify the instance ids yourself, keep this in mind.

The scale controller in Azure Functions checks the latency for peeking queue messages to decide if more or less instances are needed, similarly to activities. But the upper bound is limited by that partition count; it won’t attempt to scale over that even if latencies in control queues remain high.

You can of course increase the partition count to increase maximum parallelism. The maximum is 16. But should you increase this limit? The documentation has this to say:

Generally speaking, orchestrator functions are intended to be lightweight and should not require large amounts of computing power. It is therefore not necessary to create a large number of control queue partitions to get great throughput for orchestrations. Most of the heavy work should be done in stateless activity functions, which can be scaled out infinitely.

For most scenarios, having 4 partitions will probably be enough. As they say, your orchestrators should not be doing heavy work. Also, a single Functions instance can execute multiple orchestrators/entities in parallel. This is controlled with this setting in host.json:

{
  "extensions": {
    "durableTask": {
      "maxConcurrentOrchestratorFunctions": 10
    }
  }
}

The default for this setting is the same as activities, 10 x number of cores on the instance. So if the instance has 4 CPU cores, max 40 orchestrator/entity functions can run concurrently on the instance. But again,
remember that each running orchestrator will use some of the shared resources on the instance.

Summary

Orchestrator and Entity functions’ scale is limited by the number of partitions configured in host.json. At most 16 Function instances can run them. For most scenarios, the default of 4 partitions/max 4 Functions instances is good enough.

Activity functions are not limited to scale. Azure Functions can scale them out to as many instances as needed.

Put computationally intensive code in activities to prevent blocking orchestrator execution. By keeping your orchestrators light, the default partition count will most likely be enough.

You can adjust the number of concurrently executing functions through settings in host.json as well. Keep in mind that increasing these limits puts more strain on shared resources on the instances.

Links