I lead the engineering organization responsible for Office 365. My team builds, operates and supports our Office 365 service, and over the last few days, we have not satisfied our customers’ needs. On Thursday, November 8 and today, November 13 we experienced two separate service issues that impacted customers served from our data centers in the Americas. All of these issues have been resolved and the service is now running smoothly. These incidents were unique to the Office 365 Exchange Online mail service, not related to any other Microsoft services.
I’d like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused. We know that email is a critical part of your business communication, and my team and I fully recognize our responsibility as your partner and service provider. We will provide a post mortem, and will also provide additional updates on how our service level agreement (SLA) was impacted. We will be proactively issuing a service credit to our impacted customers.
I also want to provide more detail about the recent issues.
The first event occurred on November 8th from 11:24AM to 7:25PM PST. This service incident resulted in prolonged mail flow delays for many of our customers in North and South America. Office 365 utilizes multiple anti-virus engines to identify and clean virus messages from our customers’ inboxes. One of these multiple engines identified a virus being sent to customers, but the engine started to exhibit a lot of latency even as it handled the messages. To compound the issue, our service was configured to allow too many retries and provide too long of a timeout for these messages. Given the flood of these specific emails to some of our service capacity, this improper handling caused a significant backlog of valid email message throughput in these units. We resolved the issue by deploying an interceptor fix to deal with the offending messages and send them directly to quarantine. Going forward, we are instituting multiple further levels of defense. In addition to fixing the engine handling, we now have instituted more aggressive thresholds for deferring problem messages. We have also built and implemented better recovery tools that allow us to remediate these situations much faster, and we are also adding some additional architectural safeguards that automatically remediate issues of this general nature.
From 9:08AM to 2:10PM PST today, November 13th, some customers in North and South America were unable to access email services. The service incident resulted from a combination of issues related to maintenance, network element failures, and increased load on the service. This morning, the Office 365 team was performing planned non-impacting network maintenance by shifting some load out of the datacenters under maintenance. In combination with this standard process, we experienced a ‘gray’ failure of some active network elements; the elements failed, but did not alert us to their failure. Additionally, we have an increasing load of customers on-boarding to the service. These three issues in combination caused customer access to email services to be degraded for an extended period of time. By 10:42am PST, remediation work was underway to balance users to healthy sites, broaden the service access points and remediate the failed network devices. At 2:10PM PST all services were fully restored. Significant capacity increase has already been well underway, but we are also adding automated handling on these gray failures to speed recovery time. Across the organization, we are executing a full review of our processes to proactively identify further actions needed to avoid these situations.
As I’ve said before, all of us in the Office 365 team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business – that’s not acceptable. I want to assure you that we are investing the time and resources required to ensure we are living up to your – and our own – expectations for a quality service experience every day.
As always, if you are experiencing any service issues, we encourage customers to check the Service Health Dashboard for the latest information or contact our customer support team. Our customer support is available 24 hours a day via Service Requests submitted from the Office 365 Portal.
Corporate Vice-President, Microsoft Office Division