Azure Event Grid — Handling Failures

Azure Event Grid — Handling Failures

What happens when a parcel cannot be delivered? There might be an attempt to retry delivery a few times. The customer might get notified after 3 failed attempts to deliver, but ultimately it ends up in storage and becomes the customer’s concern to instigate a pick up.

In my previous post I gave an overview of what Event Grid is, conceptually. If you missed it then you need only understand 2 concepts:

  1. The event grid topic provides an endpoint where the source sends events.
  2. A subscription details which events on a topic you’re interested in receiving, and where they should go.

When making something production-worthy, failure scenarios often influence design. When dealing with Event Grid, we should consider what happens when an event handler cannot process an event. If you want details, Microsoft explains how exactly Event Grid handles events when delivery to a subscriber fails.

Retry policy & dead-lettering

Event Grid allows you to configure the retry policy upon delivery failure. By default, when Event Grid receives certain responses when sending to a subscriber (like 400 or 413), it short-circuits any retries as they are deemed non-transient. Sensible in my opinion, but once configured retries are exhausted, not much really happens. All you can do is store the failed events in a storage account. A process known as dead-lettering.

Resubmit failed events

There is no built-in functionality to replay the failed messages to the subscriber who failed to process it.

Note I said subscriber and not topic — if you resubmit failures to the topic then all subscribers to that topic will receive a message that they might have already processed successfully. In theory, subscribers should be idempotent because Event Grid promises at least once delivery, but I still feel it best to not give the subscribers unnecessary traffic)

If you want this replay option you have 4 options, in my eyes:

  1. Ignore the problem and accept it as a design choice of Event Grid
  2. Home-brew a solution by inventing some app that can read the failures and replay them to the subscriber. This would entail reading an event payload from a storage account and submitting it to the subscriber
  3. Buy a 3rd party solution like this offered by Serverless360, which essentially wraps your event grid and gives some extra features
  4. Use a different messaging service like a Service Bus which provides this out-of-the-box

Might sound lazy but I prefer either 1 or 4. The moment I have to solve something so integral to a design, I am smelling one of two things: either Event Grid is designed not to support this or it is missing key functionality. I highly doubt it’s the latter.

If you need good replay functionality then it’s probably because your events are highly valuable and cannot be lost. And if this is the case, you will want something like a Service Bus. Check this from the docs:

When handling high-value messages that cannot be lost or duplicated, use Azure Service Bus.

Closing thought

Event Grid is a point-to-point messaging system. What this means is that it connects two systems, via an event, at one point in time. The way I see it, it was not designed to support a bullet-proof recovery from all scenarios. It has not always supported dead-lettering which is the only decent way of recording failures, at present.

I am not a fan of shoe-horning, especially when choosing a third-party framework like Event Grid, so I would take Event Grid’s handling of failure as intentional design. If I had a problem with it, I would re-think my design.

I like the fact that you can configure the behaviour of the retry policy, although the default should make sense in most scenarios.

The dead-lettering feels like it should form more of an audit trail rather than a recovery plan. Setting up alerts to subscribers on failure seems far more helpful, though.