Microsoft Fabric–Why Are You So Down?

Microsoft Fabric is software-as-a-service platform for data processing, business intelligence reporting, and even online transaction processing apps. Fabric has been generally available since November, 2023. Building planet scale platforms is hard, and I have a great deal of sympathy for engineers and architects who are dealing with layoffs, constant pressure for new features, and the push to somehow incorporate AI into every piece of every platform.

All that being said, customers buy online services and expect them to be available. One of the reasons a company chooses Fabric, Databricks, or Snowflake is the notion that those platforms for Spark and various data warehousing options will be secured, patched, and better maintained than a non-technology company could do by simply deploying Spark into Kubernetes or VMware. With that, the cloud providers have an obligation to deliver services to their customers, and deliver availability and performance congruent with their pricing.

One of the things I expect from a cloud provider is honest post-mortems when they have an outage, and maintaining a history of their outages. These histories help architects better design systems, as we can better identify weaknesses in various cloud services that we might want to design around. Azure and AWS both do an excellent job of providing detailed information around “what happened” in incidents.

Cloud providers service customers all over the world, and should feel an obligation to provide detailed incident reports. Azure and AWS are also supported by financially backed service level agreements (SLAs), which mean customers get refunded during outages exceeding the SLA of a service. For example, Azure SQL Database has a 99.995% SLA for Business Critical. If you’re Azure SQL DB is unavailable for more than roughly 2 minutes in a month you’re due a refund (this also applies to data loss or slow failover events). This isn’t business continuity insurance—Microsoft doesn’t pay you for loss sales during the outage, but it’s better than nothing.

Fabric on the other hand, doesn’t have a dedicated SLA. While there is a service reliability documents page, customers have extremely limited options for cross-regional services. There isn’t a simple “geo-replicate” my Fabric environment that you can enable for any amount of money. Microsoft also has a separate Fabric status page, which maintains a status history (which seems to go back a week or so) and doesn’t track historical incident reports like the Azure page does. There was a global Fabric outage in May, there’s no public facing page that details the outage. I can still see it in my Microsoft 365 admin center, but that only keeps 30 days of history and it’s about to roll off. Additionally, the status page for Fabric doesn’t seem to retain the history of all of the service health events. Thanks to u/Fabric-Status on Reddit, I was able to capture all of the service degraded events in June.

DateServices ImpactedRegional Scope
June 23, 2026Power BI LabelingGlobal
June 23, 2026Data PipelinesUK South
June 22, 2026Power BI (Narrow)Brazil South
June 20, 2026Copilot/MLUAE North
June 17, 2026Power BI DownloadsGlobal?
June 14, 2026Fabric Control PlaneEast US 2
June 12, 2026ActivatorIndia Central
June 10, 2026Copilot/MLNorth Central US
June 10, 2026Spark Jobs, Notebooks, Python/R VisualsEast US 2
June 4, 2026Power BI RefreshGlobal

That’s quite a list, and they aren’t all in the Fabric status portal—they were all one time, but they weren’t archived and we don’t have detailed incident reports on any of them. The May outage was global and users literally couldn’t access Fabric, and all we know is that it was a DDOS attack, but with no other additional details.

I know this comes off as old man yells at cloud. But this is an expensive service that customers use globally, that has had 12 service degradations in the last month, and customers don’t have a root cause on any of them. This is simply unacceptable behavior from any service provider.

There was a time in 2018-2019 where Azure was a mature platform, but at the same time suffering from a lot of service reliability issues. One of these involved a data center being struck by lightning, but there was also a major Azure AD (now Entra), and DNS problems that had widespread impact (including some very limited Azure SQL DB data loss). After all this happened, Microsoft made major investments in improving the reliability of the Azure platform as a whole. Nothing is perfect—there are still Azure outages, but the response, the reporting process makes them fully transparent to customers.

I feel like every time I’m writing about Fabric I’m yelling at the leadership team to make Fabric more like Azure. That’s probably accurate—Azure is a robust mature platform that supports enterprise controls, auditing, policy, reliability and security. There are still parts of Azure that suck (anyone know how monitor usage on a business critical readable secondary?), but overall I have faith in the platform, because of how open Microsoft have been when they have failures.

Fabric needs to be more robust, more transparent, and more reliable. Given the scope of most of these outages, I suspect they relate to the cadence of software releases. New features are cool, but services that offline because the new feature release went bad are really uncool. Not having an outage history, or options to make your Fabric environment more fault tolerant is even less cool. Microsoft should do their customers better. Fixing this include a few things:

  • A persisted outage history page with detailed incident reports
  • Service Level Agreements for every component in Fabric and their expected availability
  • The ability to monitor Fabric from outside of Fabric
  • Simply better reliability—the current state is unacceptable

Some of this is easy—the outage history page should happen yesterday. And put it on the Azure page—the fact that the DDOS report was trapped behind an M365 login was frustrating. The SLAs are something customers should demand. The other two things are harder, but every other important system has them. Reddit shouldn’t be your outage reporting system.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?