Update on recent customer issues…

I lead the engineering organization responsible for Office 365.  My team builds, operates and supports our Office 365 service, and over the last few days, we have not satisfied our customers' needs.  On Thursday, November 8 and today, November 13 we experienced two separate service issues that impacted customers served from our data centers in the Americas.  All of these issues have been resolved and the service is now running smoothly. These incidents were unique to the Office 365 Exchange Online mail service, not related to any other Microsoft services.

I'd like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused.  We know that email is a critical part of your business communication, and my team and I fully recognize our responsibility as your partner and service provider. We will provide a post mortem, and will also provide additional updates on how our service level agreement (SLA) was impacted.  We will be proactively issuing a service credit to our impacted customers.

I also want to provide more detail about the recent issues.

The first event occurred on November 8th from 11:24AM to 7:25PM PST.  This service incident resulted in prolonged mail flow delays for many of our customers in North and South America.  Office 365 utilizes multiple anti-virus engines to identify and clean virus messages from our customers' inboxes. One of these multiple engines identified a virus being sent to customers, but the engine started to exhibit a lot of latency even as it handled the messages.  To compound the issue, our service was configured to allow too many retries and provide too long of a timeout for these messages.  Given the flood of these specific emails to some of our service capacity, this improper handling caused a significant backlog of valid email message throughput in these units.  We resolved the issue by deploying an interceptor fix to deal with the offending messages and send them directly to quarantine.  Going forward, we are instituting multiple further levels of defense. In addition to fixing the engine handling, we now have instituted more aggressive thresholds for deferring problem messages.  We have also built and implemented better recovery tools that allow us to remediate these situations much faster, and we are also adding some additional architectural safeguards that automatically remediate issues of this general nature.

From 9:08AM to 2:10PM PST today, November 13th, some customers in North and South America were unable to access email services.  The service incident resulted from a combination of issues related to maintenance, network element failures, and increased load on the service.  This morning, the Office 365 team was performing planned non-impacting network maintenance by shifting some load out of the datacenters under maintenance.  In combination with this standard process, we experienced a 'gray' failure of some active network elements; the elements failed, but did not alert us to their failure.  Additionally, we have an increasing load of customers on-boarding to the service.  These three issues in combination caused customer access to email services to be degraded for an extended period of time.  By 10:42am PST, remediation work was underway to balance users to healthy sites, broaden the service access points and remediate the failed network devices.  At 2:10PM PST all services were fully restored.  Significant capacity increase has already been well underway, but we are also adding automated handling on these gray failures to speed recovery time.  Across the organization, we are executing a full review of our processes to proactively identify further actions needed to avoid these situations.

As I've said before, all of us in the Office 365 team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business - that's not acceptable.  I want to assure you that we are investing the time and resources required to ensure we are living up to your - and our own - expectations for a quality service experience every day.

As always, if you are experiencing any service issues, we encourage customers to check the Service Health Dashboard for the latest information or contact our customer support team.  Our customer support is available 24 hours a day via Service Requests submitted from the Office 365 Portal.

Rajesh Jha

Corporate Vice-President, Microsoft Office Division

Office Blogs Comments

Comments: (32) Collapse

  • Thanks for the feedback.  As stated above, while the primary mechanism for communicating to our customers on issues has been and will continue to be the Service Health Dashboard, we are also always trying to improve various other channels such as Twitter and our own community.  Twitter is an excellent mechanism to provide a notification, and we will continue to spend time improving our speed to response on these kinds of issues.

  • the health dashboard is useless. Time to throw it out and start over with realtime stats and statuses.

  • Estimado Rajesh Jha, si bien la caida de servicio tuvo impacto en la organización de ORIGENES SEGUROS DE RETIRO, me sentí totalmente acompañado por Microsoft Argentina, Perception Group (partner) y Microsoft Int´l.- Esperando que estos inconvenientes no sean reiterados, para el bien de todos, saludo a uds.

  • Something is very wrong with the service health portal. Seems to take the office365 user forum to boil over before the service health is updated. And then, we see 'some' 'few' 'potential issue'. That after dozen of notes showing up reporting the problem. My company switched to office365 9/22 and since then nearly a half dozen outages of some type. When can we expect an improvement ot the service health display and reliability?

  • Rajesh,  Either your people are lying to you or you are lying to us. This outage started before 9:00 am Pacific. We started fielding user complaints at 6:30 am Pacific. I'm guessing our SLA reimbursements wil be based on this inaccurate start time. If your people would have updated the service health dashboard properly, I could have saved a lot of time troubleshooting my ADFS environment. Lastly... I am offended by the statements your staff reported that a "few users" were imapcted. At least two Exchange clusters were impacted. A few thousand users were impacted... is my best guess. I can deal with an outage. What really pisses me off is being lied to and having my vendor downplay the severity!

  • Hello Rajesh and Office 365 team,

    This post is helpful. Here are some follow-ups that would greatly benefit us and help communicate to our end users to increase their/our confidence in Office 365

    1. Are there threshold limits for # of customer calls raised, before you can recognize and acknowledge an issue and send out a blast to your office 365 customer system admins notifying us about the outage? For example, there were 3 or 4 IT resources spending 2-3 hours for each incident, troubleshooting this  problem. Whereas, that effort can be saved if we knew about the issue and got a broadcast and in turn notify our internal customers. Please also send that email notification to a non office 365 account.

    2. Since this was a significant outage affecting several customers, will you be hosting a LIVE meeting and answering questions submitted before or during the conference call.

    3. There is not a single place either on the blogsite or the Portal to check the status of the system unless you are an adminstrator. And I heard from my admins that there was no blog/notification till much later. Would you be able to publish the metrics or atleast outages in a timely manner in multiple places vis-a-vis blog, portal, etc.

    4. Do you have tools to monitor particular servers affected and to let us know the specific user population facing an issue. For the Nov 13th outage seems like only folks on specific servers were affected. And from what we can tell our user base is spread across about a 100 servers.

    Thanks much,

    Arul Daniel, Jim Kelly & Scott Chinn

    Amyris IT Team

  • Hi John

    Not Rajesh, but i thought I would reply to your post.  Thanks for taking the time to comment.  Based on the feedback from this comment blog and in the communities, we definitely are reviewing policy and procedures on posting to the Service Health Dashboard, and the speed at which we do that to represent accurately the situation with the service.  We're convinced that we've found root cause and remediated the issue, but remain vigilant for any possible service impact.

  • I would like to re-iterate the poor communication Microsoft provides with the Office365 offering from an operational standpoint.  The dashboard is never updated in a timely manner or contains accurate times for the outage.  This is not an isolated incident but has become common place.  I've grown tired of the constant apologies and need to see real change.  If my customer service, communications and sense of urgency is as poor as the O365 operations team I would be unemployed.

  • Hi Karen

    This is great feedback that we will consider for future service incidents that warrant proactive Twitter posts.  I appreciate you taking the time to provide your perspective.  One of the challenges to coherent communication is to coordinate all of the different communication channels in the midst of a significant service incident, particularly to match the level and depth expected from customers and partners.  We try to use the Twitter handle as an 'emergency broadcast system' to alert the broadest set of community members to a potential issue.  Along with that, we try to maintain an active presence in our community forums and update the Service Health Dashboard regularly.  As others have noted, we need improvement in the timeliness of our SHD updates, and we'll work hard in future events to fulfill that requirement.  We definitely recognize that it's critical to keep customers and partners well-informed in the case of any service issue, so we'll take your feedback as concrete recommendations on ways to improve.

  • Morgan,

    Following @Office365, really?

    The amount of information on Twitter was insufficient and not timely. But what is worse is that the marketing machine of Microsoft uses that moniker to advertise, so the signal-to-noise ratio is inadequate to serve as a telegraph of a problem. Three tweets over the duration of the outage is hardly overwhelming anyone with actionable information.

    During an outage, we the management of the companies that have entrusted Microsoft to run this environment, need real time information that we can use to manage the outage within our respective companies. Unfortunately, that’s not the first time you heard that from me.

    -J

  • Hello Rajesh and the Office365 team.

    First let me say that I am truly sad to say that for me as a Partner with Microsoft for

    very long time, it makes me very irritated when Microsoft makes explanations the way of

    compensation for what we are calling the “BLACK WEDNESDAY INCIDENT” of Office365.

    I can truly tell you just don’t get it. So as a result the clients I work for and pay my bills that they will be leaving Office365 very soon.

    1) Again just to review any outage that results in loss of communication should be a complete credit of the month of service or better. Microsoft needs to feel it. Partners and clients this business are at loss more than double because of the time spent trying to find out the problem and then working around a fix for the issues because of it. Maybe the IT at Microsoft should think of it like a person on a breathing machine, if the machine stops for just a few minutes he dies. Any outage is not accepted. This is MICROSOFT you can’t tell me you don’t have the resources to have the Exchange Server do a confidence check for the health of the system and then warning us of trouble and then applying a fix.

    2) You need to understand that as partners we deal with the clients every day.

    If the there is a problem we fix it. That means a work around is put in place that minute instead of reasons for the outage several days later. Don’t get me wrong it’s great as an IT person to know why the problem happened. But because of the outage and many more Brown outages not spoken about many of my clients are going to another provider for the service already as they needed something as a way to communicate. In my line of work that means a loss of revenue and confidence in my ability to business. If the outage is treated any less that that then you just don’t get it. Please think of it like you were supplying air to breathe.  A message to a company trading stock or message for a notary to go to a loan signing or a message for a client telling me of an issue with Office365 does not get seen because of the outage will not be accepted.

    Wake up MICROSOFT this was yours and my “BLACK WEDNESDAY INCIDENT”.

    Because this message comes from my clients!  Remember that.

    Sadly,

    Steve Berman

    Quick’n Computer Service

    503-679-8882 Is my Cell

    steve@qcsl.biz  via 1and1.com

    steve@quickncomputer.com  via Office365

    Which account do you think my clients will email me on?

    It’s just that simple.

  • The transparency is couched in spin   "we have not satisfied our customers' needs" means that there is no stinkin way to get your email, and on the east coast, it came up after you left the office.

    The only question of meeting Service Level Agreements is will there be another outage before the end of the month.

  • Rejesh,

    Thanks for the update, but you seriously need to spend this weekend looking at how other large organizations communicate with customers during an outage.  Many of us are MSP's and we had absolutely zero information to give our customers.  

    Go look at the tremendous job utilities, wireless providers and data-centers( peer1 squarespace etc.) handled outages during the recent east coast storm.  They all used twitter, blogs,  facebook, etc. to keep us up to date and informed.

    Post -Mortem has very little customer value...I simply assume you learned something and it won't happen again.  Give me information during the outage.

    Charles

  • My name is Jackie Wong locate in Hong Kong. Seems this problem impact our locate in Singapore Datacentre Mailbox .  A user reported mailbox cannot send out start from 7 Nov  PM to 8 Nov PM.

  • As an admin to our O365 E1 environment, a few thoughts around the incidents and this posting:

    1)  The detailed explanation and transparency here as to what happened is appreciated.

    2)  The Service Health page is typically the first place I look when I hear of a potential platform issue, not so much for the actions being taken to resolve, but to correlate any possible maintenance activity around the timeframe in question.

    3)  From an administrative standpoint, our company would benefit more from timely, accurate data on the Service Health pages as opposed to messages broadcast on other channels (Twitter, RSS, etc).

    4)  Twitter as an abbreviated emergency broadcast channel could be improved by a dedicated Service Health account (to distinguish it from advertising as another poster points out), and/or hashtags to match the ticket number (#SP2598 for example) so we could correlate a message stream directly with the incident in question.

Comments

Comments: (loading) Collapse