AWS RDS Multi-AZ failover produced read-only transactions error and the application retry/backoff pattern that prevented checkout failures

Facebook X Reddit Pinterest

Everything was working fine until — boom — your online store checkout starts failing! But it wasn’t a bug in the app. Nope. The issue came from deep within the infrastructure. Specifically, when AWS RDS triggered a Multi-AZ failover and left the database in a weird read-only state for a few seconds. That’s where our heroic retry/backoff pattern saved the day. Let’s dive into what happened and how one smart coding choice made all the difference!

TL;DR

AWS RDS Multi-AZ failovers can cause temporary read-only access. If your app isn’t ready for that, customers won’t be able to finish important actions like checkout. Thankfully, using a retry with backoff mechanism let the app recover smoothly. No lost orders, no angry shoppers!

What Is AWS RDS Multi-AZ?

AWS RDS Multi-AZ (Multi-Availability Zone) is like having a spare tire for your database. When your primary DB stumbles (hardware failure, software bugs, or maintenance), AWS flips to a backup in another availability zone.

It’s automatic, and it’s *awesome* for resilience. But there’s a catch…

Failover Isn’t Always Instant

During the switch from the primary to the standby DB, there’s a brief time when the DB is either unavailable or only allows *read-only* queries. Crazy things can happen if your app doesn’t brace for it.

Imagine someone trying to pay for their order, but the order can’t be saved to the DB because the connection is now read-only for a few seconds. That’s what this story is about.

The Problem: Read-only Transactions at the Worst Moment

Picture an online store. A customer adds a toaster to their cart, proceeds to checkout, adds their address, and clicks “Pay Now.”
Boom! The system tries to save the transaction, but the DB — due to an in-flight failover — returns a cryptic error like:

ERROR: cannot execute INSERT in a read-only transaction

Moments earlier, everything worked fine.

Now, your logs are full of sad errors. Users report checkout failures. But 30 seconds later, the DB goes back to normal.

Why Is This Happening?

RDS promoted a standby replica to become the new writer.
The old writer was now demoted — and only allowed *read-only* operations briefly.
If your app had an active connection to the *old* writer endpoint, it was trying to write to a read-only DB. Oops.

This situation is rare but real. And if your app doesn’t handle it, it can ruin key business operations in those critical seconds or minutes.

The Hero of Our Story: Retry and Backoff

What saved the day for our example app? A simple but powerful mechanism: *retry with exponential backoff.*

How It Works

When the DB throws a known error — like “read-only transaction not allowed” — the app says:

“Cool, I’ll chill for a second and try again.”

Instead of crashing or giving up, the app adds a short wait (maybe 200ms), then retries. If it fails again, maybe it waits 400ms. Then 800ms. Then 1600ms… up to a limit.

Eventually, once AWS finishes flipping over the DB writer, the retry succeeds. The transaction goes through. The toaster is sold. Nobody notices anything.

Without retries?
No checkout. Angry emails start flooding your support inbox.

Retry + Backoff = Smooth Recovery

This pattern is widely used in distributed systems to handle flaky services, network hiccups, and — you guessed it — DB failovers.

How to Build a Retry/Backoff Pattern

You can use built-in frameworks or write it yourself. Here’s a checklist to build a good retry layer:

Target specific errors: Only retry on known, transient errors like read-only transaction.
Use exponential backoff: Gradually increase the wait time between retries.
Set a retry limit: Don’t retry forever. A typical max is 3 to 5 attempts.
Add jitter: Add a bit of randomness to avoid thundering herd problems.

Most cloud-savvy libraries support this. If you’re coding in JavaScript, Python, or Java, check your favorite retry lib!

Some Code Magic

Here’s a simplified Python example:

import time
import random

def retry_transaction(function, max_retries=5):
    delay = 0.2
    for attempt in range(max_retries):
        try:
            return function()
        except Exception as e:
            if "read-only" in str(e).lower():
                time.sleep(delay + random.uniform(0, 0.1))  # jitter
                delay *= 2
            else:
                raise
    raise Exception("Max retries exceeded!")

This wrapper lets your app bounce back from failovers like a champ.

Real-World Impact: Seamless Checkout

So what happened to the app during the AWS RDS failover?

There was a 15-second window where the DB couldn’t handle write transactions.

But thanks to the retry/backoff logic in the checkout flow, the app handled it gracefully. Shoppers just saw a slightly longer “Processing Payment” spinner. That’s it.

No errors, no lost data, *no abandoned carts.*

This tiny piece of resilience code saved real revenue.

Bonus Tip: Monitor for Failovers

Also, it helps to know *when* a failover happens. AWS CloudWatch and RDS Events can alert you. Set up monitoring and logging around these:

RDS-EVENT-0025: Primary DB instance failover started
RDS-EVENT-0031: Failover complete
CloudWatch metric: DatabaseConnections dropping to zero

With alerts, you’ll have insight the next time this happens — even if your app handles it perfectly.

Other Use Cases for Retry/Backoff

It’s not just for RDS. This pattern is useful for:

Calling Stripe or PayPal
Sending emails through third-party APIs
Saving files to S3 or GCS
Talking to anything flaky on the network

It’s a humble strategy that turns an app from “brittle” to “battle-tested.”

Takeaways

Even cloud-managed services like RDS can have momentary issues.
A Multi-AZ failover can leave connections in a brief read-only state.
If your app tries to write during that time, it may crash unless you’re ready.
Implementing retry with exponential backoff helps the app stay resilient and user-friendly.

Final Thoughts

Cloud apps live in a bumpy world. Resilience is your secret weapon.
The next time a Multi-AZ failover strikes, you’ll be cool — just watching retries do their thing.

Happy deploying! And, may all your checkouts succeed, even through chaos.

Facebook X Reddit Pinterest