There is a popular misconception that introducing batching sacrifices latency to improve throughput.
I want to share a technique that enables batching without incurring any additional latency. I call it the “non-blocking batching” but this name isn't super popular.
I first read about it in an article by Lorenzo Alberton but I haven't seen much use of this technique in the wild.
Batching is a common technique to improve the efficiency of processing a queue of tasks. It is performed by gathering multiple tasks into one batch and processing multiple tasks at once instead of one by one.
When processing tasks in a queue, there are two metrics that we care about:
Batching improves throughput by amortizing the overhead of expensive operations that can be done once per batch instead of once per task.
Examples of expensive operations:
Usually, batch processing is implemented like this:
Wait for a new task in a queue (blocking indefinitely)
Add the task to a new batch
Wait until a specified timeout has passed, adding new tasks to the batch
Process all tasks of the batch
Loop to the beginning
For a random task distribution (governed by a Poisson random process), the batch processing looks as follows:
On this diagram, there are two timelines:
We have two parameters for batching:
Maximum batch size.
Batch processing time is non-linear in the batch size, and after some threshold, batching stops being effective. Limiting the batch size prevents this from happening.
Batch waiting time.
Increasing batch waiting time allows more tasks to accumulate into a single batch.
But simultaneously this is a wasted time that increases the latency of every task by approximately half of the batch waiting time.
Usually, the batch waiting time is set to around half of the acceptable latency or just by eyeballing it.
But, counterintuitively, it is best to set this time to zero. That is, we have the following batching algorithm:
Wait for a new task in a queue (blocking indefinitely)
Get up to MaximumBatchSize
tasks from the queue without waiting
Process the tasks
Loop back to the beginning
Let's analyze the behavior of this batching in several scenarios.
Scenario 1. The rate of arrival of new tasks is slow (much slower than processing time). In this case, every task will be processed immediately with no additional latency. Additional waiting time would bring no benefit and just increase the latency.
Scenario 2. Tasks arrive at regular intervals, but the average interval length is smaller than the processing time of one task.
We get full utilization of processing resources with no additional latency. The batch size automatically adapts to the speed of new tasks’ arrival and the processing time.
Scenario 3. New tasks arrive irregularly, in bursts. When idle, the first task of a burst gets processed immediately and the rest gets into one batch.
This might be the only case when waiting might make sense since it will give a change for the entire burst to be processed as one batch.
I did a simulation of different cases and found out that the non-blocking batching beats the conventional batching every time.
Here's how two batching algorithms compare in different scenarios.
Scenario | Average latency with non-blocking batching | Average latency with blocking batching |
---|---|---|
Slow rate of new tasks | 1.55s | 1.94s |
Fast rate of new tasks | 5.13s | 7.1s |
Bursty new tasks | 2s | 2.06s |
New tasks arrive at the same as they can be processed | 5.14s | 7.12s |
We can see that in all scenarios it is better to do non-blocking batching compared to conventional blocking batching.
In all cases, the following parameters were used:
The code for the benchmark is available at https://github.com/dmitryvk/nowait-batch-modeling.
The non-blocking batching is the superior batching algorithm. It gives a better latency, has fewer edge cases, and does not require complicated tuning.
It should be the default when developing systems that have a queue somewhere.
Although non-blocking batching improves latency and throughput, it also in many cases increases the utilization (that is, the fraction of a time when the processor is active). If we have a low rate of arrival of new tasks and we want to minimize the utilization of the processor, then it is beneficial to introduce artificial delays.
This batching easily generalizes to multiple workers. Each worker would do batching independently.