Update on recent customer issues…

I lead the engineering organization responsible for Office 365.  My team builds, operates and supports our Office 365 service, and over the last few days, we have not satisfied our customers' needs.  On Thursday, November 8 and today, November 13 we experienced two separate service issues that impacted customers served from our data centers in the Americas.  All of these issues have been resolved and the service is now running smoothly. These incidents were unique to the Office 365 Exchange Online mail service, not related to any other Microsoft services.

I'd like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused.  We know that email is a critical part of your business communication, and my team and I fully recognize our responsibility as your partner and service provider. We will provide a post mortem, and will also provide additional updates on how our service level agreement (SLA) was impacted.  We will be proactively issuing a service credit to our impacted customers.

I also want to provide more detail about the recent issues.

The first event occurred on November 8th from 11:24AM to 7:25PM PST.  This service incident resulted in prolonged mail flow delays for many of our customers in North and South America.  Office 365 utilizes multiple anti-virus engines to identify and clean virus messages from our customers' inboxes. One of these multiple engines identified a virus being sent to customers, but the engine started to exhibit a lot of latency even as it handled the messages.  To compound the issue, our service was configured to allow too many retries and provide too long of a timeout for these messages.  Given the flood of these specific emails to some of our service capacity, this improper handling caused a significant backlog of valid email message throughput in these units.  We resolved the issue by deploying an interceptor fix to deal with the offending messages and send them directly to quarantine.  Going forward, we are instituting multiple further levels of defense. In addition to fixing the engine handling, we now have instituted more aggressive thresholds for deferring problem messages.  We have also built and implemented better recovery tools that allow us to remediate these situations much faster, and we are also adding some additional architectural safeguards that automatically remediate issues of this general nature.

From 9:08AM to 2:10PM PST today, November 13th, some customers in North and South America were unable to access email services.  The service incident resulted from a combination of issues related to maintenance, network element failures, and increased load on the service.  This morning, the Office 365 team was performing planned non-impacting network maintenance by shifting some load out of the datacenters under maintenance.  In combination with this standard process, we experienced a 'gray' failure of some active network elements; the elements failed, but did not alert us to their failure.  Additionally, we have an increasing load of customers on-boarding to the service.  These three issues in combination caused customer access to email services to be degraded for an extended period of time.  By 10:42am PST, remediation work was underway to balance users to healthy sites, broaden the service access points and remediate the failed network devices.  At 2:10PM PST all services were fully restored.  Significant capacity increase has already been well underway, but we are also adding automated handling on these gray failures to speed recovery time.  Across the organization, we are executing a full review of our processes to proactively identify further actions needed to avoid these situations.

As I've said before, all of us in the Office 365 team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business - that's not acceptable.  I want to assure you that we are investing the time and resources required to ensure we are living up to your - and our own - expectations for a quality service experience every day.

As always, if you are experiencing any service issues, we encourage customers to check the Service Health Dashboard for the latest information or contact our customer support team.  Our customer support is available 24 hours a day via Service Requests submitted from the Office 365 Portal.

Rajesh Jha

Corporate Vice-President, Microsoft Office Division

Office Blogs Comments

Comments: (32) Collapse

  • I truly appreciate the transparency refelected in this post. I hope it is perpetuated throughout Microsoft. O365 is a journey that we are in together and this transparency makes me as a customer feel more like a partner in this journey.

  • Hi Rajesh, with all due respect to your last paragraph, the Service Health Dashboard is the absolute last place you want to look when you see issues.  In the case with the last 3 outages, the portal did not report any issue or problems until well over an hour, in one case 2 hours after we started experiencing the problem.  MS can do a much better job of reporting problems to customers.  Customers are wasting valuable time troubleshooting problems that MS is already aware of but just has not reported yet.  I'm sure that MS was aware of the issue because when I called the 800 support number the technician was already aware of the problem.  If they are already aware of the problem, then put it on the portal and save us the 30-60 minutes of waiting on the phone!  Thanks.

  • As a re-seller, I would like a place I could check the status (if promptly reported, as Keane mentions, it isn't always) that I don't have to log in.  I am subscribed to the "Office 365 Service Health RSS Notifications" RSS feed - but every time I want to view the article for more details, I am stuck having to log in.  It would be far more convenient if I could quickly pull some information to my customers calling in wondering if it's a problem with the service, or something more local to their network or PC.

  • I appreciate the fact that you have provided an update with an actual explanation. However, that doesn't excuse the terrible communication and process failure that occurred during the outage. In addition to addressing the technical issues that caused both of these outages, Microsoft also needs to address the monitoring and communication processes.

    Also, your times on the November 13th outage are incorrect, as I was experiencing mail problems as early as 8:45am PST; in fact I reported it to Microsoft before 9AM--and our engineers were on hold with premier support for over 15 minutes by then as the support queue was overwhelmed. I'm curious to know how you identified 9:08 as the start of the incident.

  • I agree with the previous commenters - I was able to gather more information that there was a problem over 2 hours before Microsoft reported that they started investigating by following #Office365 on Twitter. I think it would have saved people a lot of time and trouble if MS would use the social media more effectively to communicate issues than relying on your current reporting system.

  • This is a great post.  The recognition of 'gray' failure is intriguing.  

    Thanks for the insight into how massive scaling of connected systems rquires us to take off the blinders of digital certainty and give serious attention to the contingent reality.  We need more accounts like this.  

    Also, I commend the trustworthiness that is exhibited by the care reflected in your account and how the breakdown is dealt with and accounted for.

    It is intriguing that the November 2012 issue of Communications of the ACM that I just read through features system resiliency and, not by name, 'gray failures' in file systems.

  • PS: I see by the earlier comments that another advantage to this level of transparency is gaining valuable feedback on how to improve notification and providing ways of easily knowing system status.  That's great to see.

  • Mr. Jha, I have to agree with the other comments that its nice to have the dashboard, but its nearly useless when there is a massive problem and everything is green for over an hour.  Even if it were just a "huh, we just had 200% spike in calls for Exchange in the past 10 minutes; maybe we should at least put up the 'There might be a potential problem'" just so there's not as many of us flooding the call center, but as a fellow engineer, the first step after identifying there is a problem should be notifying of it.

    Also, is the "increasing load of customers on-boarding to the service" related to the US VA announced earlier that day was it?  I sure hope ya'll can deal with growth otherwise this is going to be a limitting issue for O365.

    Finally, what sort of network issue that was "grey" from 9:08 to 10:42 results in a 3+ hour *outage* for everyone as a 'failover'.  You really need to examine your failover triggers and determine theresholds against the remediation time.

    Food for thought.

  • I would agree that historically the Health Dashboard is way behind the actual event (almost 90 minutes on 11/13).  Then it was updated twice with this  message: "A few users are unable to access their email at this time."    Both incidents, used the words a "few users".  When you have an issue, it rarely will affect only a few users.

    It's hard to not to take issue or "wordsmith" the message, but it seems like the more appropriate words would have been "some users" or "many users".   A few users makes it sound like a very minor issue.   Ninety minutes in it was hard to believe it was only a few users.  We are professionals and are held accountable for our decisions.  One of those decisions was choosing Exchange Online.  We deserve answers and updates we can work with and not that a few users might be having issues.

  • Communication during a service incident or outage is critically important. We (like others here) have clients contacting us immediately trying to figure out what is going on (is it their computer, ISP, Internet, MS?). It was great to find this detailed explanation here but disappointed that #Office365 on twitter only shows 'The issue has been resolved and the service is now restored' at 4PM PST on Tuesday. A link back to this report would have been helpful and allow us to provide a more technical explanation regarding the issue.

  • I am amazed you are holding to falsehoods of the issue starting at 9:08 am, which matches the 12:08 pm Eastern on the web site for the start of failures.  Failures really started about 10:30 am Eastern Time for my first call.

    After checking the site around 10:45 and then spending another 45 minutes tracking the problem, I called the 800 number (about 11:30 am Eastern) and received a message that there was an issue and to log a request on the web site.

    Of course, being obedient, I went to the web site, and it said I should check the status.  The status was still not listing a problem, though it was obvious there was an issue.  You wasted 90 minutes of my time because of your lack of professionalism.

    Now I come here and what I see is the same misinformation continued.  My complaint is about communication, not about having a problem.  I am glad to see so many others offended by this.  Maybe you will get a clue how important this is to those who really support the end users.

  • Great, detailed note.  Stellar to see ownership and responsibility and commitment to make this improve.  One area to improve - the SHD doesn't display issues, Or perhaps it's better to say it has too big a bias to show GREEN when it's really Yellow/Red, as I've seen it "green" a couple times when the general outages you reference were going on.  yesterday around 11 it showed green for example.  With public acknowledgement like this, I am sure this will get fixed, and smart engineers are digging into it right now:)

  • My name is Morgan Cole, and I'm a Director in our Customer Experience team for Office 365.  First, thanks for taking the time to write up the comment and feedback on our service communication.  As Rajesh states above, we're doing a post-mortem of the issue, which will include a detailed walkthrough of the customer impact and timing of our communications.  We value your perspective as it will help us to hone our ability to respond with appropriate speed.  We understand that it is critical for our customers to be as fully informed as possible during service incident, and it is a consistent goal for us to continue to improve the timeliness and specificity of our communications.  While the primary mechanism for communicating to our customers on issues has been and will continue to be the Service Health Dashboard, we are also always trying to improve various other channels such as Twitter and our own community.  Again, many thanks for taking the time to post a reply and help us to gain greater insight on customer experience improvements.

  • That is a great piece of feedback, particularly related to an easier method for service incident notification.  The opportunity we have with a service is to institute some of these capabilities as part of regularly-scheduled updates, and we will definitely take your feedback into our resourcing and feature planning.  Thanks for taking the time to post a reply.

  • Hi Howard

    Please see my reply above to Mr.Grivich.  We definitely will evaluate what seems to be a miss regarding our service communication timeliness as part of the service incident post mortem.  Thanks for taking the time to provide your perspective.

1 2 3  Next >
Comments

Comments: (loading) Collapse