Shut the Front Door–How to Get It Back Open

By: Joey D'Antoni

Published On: November 4, 2025

This week Microsoft Front Door suffered another major outage. I wrote about the last outage(s) in my column at Redmond just a couple of weeks ago. Microsoft Front Door is a global content delivery network that does a number of other services for websites/APIs/endpoints. One of the challenges around Front Door is that being a global service, when it goes down, there’s no native failover process that you can easily use.

closed red wooden door — Photo by Harrison Haines on Pexels.com

Microsoft has published an initial incident report and there were some interesting details.

How did we respond?

15:45 UTC on 29 October 2025 – Customer impact began.
16:04 UTC on 29 October 2025 – Investigation commenced following monitoring alerts being triggered.
16:15 UTC on 29 October 2025 – We began the investigation and started to examine configuration changes within AFD..
16:18 UTC on 29 October 2025 – Initial communication posted to our public status page..
16:20 UTC on 29 October 2025 – Targeted communications to impacted customers sent to Azure Service Health..
17:26 UTC on 29 October 2025 – Azure portal failed away from Azure Front Door.
17:30 UTC on 29 October 2025 – We blocked all new customer configuration changes to prevent further impact.
17:40 UTC on 29 October 2025 – We initiated the deployment of our ‘last known good’ configuration..
18:30 UTC on 29 October 2025 – We started to push the fixed configuration globally..
18:45 UTC on 29 October 2025 – Manual recovery of nodes commenced while gradual routing of traffic to healthy nodes began after the fixed configuration was pushed globally..
23:15 UTC on 29 October 2025 – PowerApps mitigation of dependency, and customers confirm mitigation..
00:05 UTC on 30 October 2025 – AFD impact confirmed mitigated for customers..

Nothing reads too out of the ordinary for a cloud outage–but a couple of things there was around 8.5 hours of downtime for the service. The other notable thing (bolded) is that Microsoft failed the Azure Portal away from Front Door. There was some comments about this in the earlier incident report. So that brings up the question–do you need to have a plan to fail away from Front Door?

Do You To Need to Be Multi-Cloud?

I talked about this in my Redmond column, but implementing a backup solution to Azure Front Door, is inherently a multi-cloud solution. There are a few choices for global WAF solutions–not just from hyperscaler like AWS, Azure and GCP, but also CloudFlare, But if you’re application is global, has a low recovery point objective, and is critical to your business then you need to multi-cloud.

The bigger question is does your entire stack need to multi-cloud? I would argue, that at least in light of our knowledge of cloud failures–probably not. Unless you have an extremely tight SLA–you are greatly increasing the cost and complexity of the network stack. In fact, I would argue most applications don’t need this kind of highly available network stack.

In designing this I took some lessons from what I think the Azure Portal team has done–I suspect they have their servers behind Application Gateways and Front Door interacts with those applications

Diagram illustrating a global web content delivery and load balancing architecture involving TM-Failover, FD-Global, AppGW for US West and East regions, and Cloudflare.

The basic notion is we use Azure Traffic Manager with priority routing, the Front Door instance pictured here would be the initial fallback. That gives us some degree of protection against Front Door failures, and that approach seemed to work for the most recent outages. However, there was a lot of downstream DNS issues in other Azure services that raised concerns. For example, you could login to the portal, but a symptom was that you could only see Resource Groups, but no other resources.

Cloudflare comes into play here, presuming you can’t make any app updates, or your app gateways go sideways. You could recreate all of your Front Door functionality and have easy failover, in effectively a completely different provider. That doesn’t help if Azure were to go completely down, but we haven’t seen an outage like that, since the great certificate expiration failure of 2013. Generally speaking failures are limited to regions–these Front Door outages are exceptions to that rule, as Front Door is a “global” service and isn’t homed to a single region (which makes the recent outages more infuriating).

Traffic Manager being in Azure is a concern to me–I put it into this architecture because it’s relatively easy to configure, but being in another cloud for DNS could be a good option. Both Google (Google DNS) and AWS (Route 53) have services that allow for multiple IP addresses and failover based on health probes, or you could use a service like DNSMadeEasy to also handle this. DNS is really the ultimate challenge in any sort of a multi-cloud scenario–where do you put it.

There’s a lot more detail here than in a normal blog post, and yet, I held back a lot of detail. There’s a method to my madness, I’ll be publishing a white paper and doing a webinar with my friends from Denny Cherry and Associates Consulting, John Morehouse and Denny Cherry, to discuss pros and cons, and detailed configurations for how to make your applications more resilent when the Front Door closes as they say. Look for more details on that over at dcac.com/ in the next few weeks.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?

Shut the Front Door–How to Get It Back Open

Do You To Need to Be Multi-Cloud?

Share

Leave a Reply Cancel reply

Trust DCAC with your data