RESOLVED 20180904 eLynx Application Outage due to Microsoft Azure Outage

Follow

Comments

9 comments

  • Samantha McPheter

    UPDATE as of September 4, 2018 3:28pm Central Time:

    Microsoft just told us the following:

    Update on the South Central Azure outage: Azure AD and VSTS tenants should be coming back online now and in the near future.  Some networking has been restored, but the data center itself and all of the services it hosts are is still down.  There is still no ETA, but they are still actively working.  This is affecting hundreds of Azure customers and carries the highest possible priority to be resolved.

  • Samantha McPheter

    UPDATE as of September 4, 2018 5:00pm Central Time:

    Microsoft's latest update still provides no ETA of South Central service coming back online, so the eLynx team is preparing to execute our disaster recovery response by failing over to the North Central Azure facility.  We anticipate making a 'go/no-go' call at 7:00pm central time depending on the status of the South Central facility coming back online.

  • Samantha McPheter

    UPDATE as of September 4, 2018 7:00pm Central Time:

    While Microsoft's latest update still provides no ETA of the entire South Central service coming back online, services are coming back slowly.  At this point in time, the Classic Cloud Services in Azure South Central and Azure North Central are down, which is prohibiting us from failing over to North Central.

    We are starting to receive current data again, indicating the system is coming back online.  Alarms are not yet being processed.  The Alarm Engine requires that the Classic Cloud Services be functioning.

    We will provide another update by 9:00pm Central Time.

  • Samantha McPheter

    UPDATE as of September 4, 2018 9:00pm Central Time:

    Azure South Central Services are continuing to come back online.

    All polling engines are running and current data is being stored for processing once the entire system is back online.

    Alarms are still not yet being processed. We anticipate alarms starting back up within the next couple hours as Microsoft continues to make progress bringing Azure South Central back online.

    We will provide another update by 11:00pm Central Time.

  • Samantha McPheter

    UPDATE as of September 4, 2018 11:00pm Central Time:

    Azure South Central Services are continuing to come back online.

    Users can now log into the application.

    Current data is not yet posting to the site and alarms are not yet being processed. 

    The eLynx Team is currently examining the Azure services related to current data and alarm processing to see why those services are not yet working.

    We will provide another update by 1:00am Central Time.

  • James Campbell

    UPDATE as of September 4, 2018 11:20pm Central Tie:

    Azure services are continuing to come back online.

    The application website is back online, and Demand Scans are functioning but a couple of critical services are still not available and so while we are still polling, that data is still being stored for processing once the rest of the system is back online.

    We will provide another update by 1:00am Central Time.

  • James Campbell

    UPDATE as of September 5, 2018 12:50am Central Time:

    Azure services are still seeing a performance degradation, but all critical services for SCADALynx are back online and we are processing data at this time and expect to be caught up in the next 10 to 15 minutes.

    We expect performance issues to continue throughout the night while Microsoft continues to bring their systems back to fully operational status.

    We will provide another update shortly after 8:00am Central Time.

  • Samantha McPheter

    UPDATE as of September 4, 2018 1:30am Central Time:

    All services are back online.

    Users can now log into the application.

    Current data is posting to the site and alarms are being processed.

  • Mike Greenly

    UPDATE as of September 7, 2018 11:00am Central Time:

     

    Current Status:  RESOLVED

    The recovery of Microsoft Azure services in the South Central US data center has been completed and the eLynx application and services have been stable and operating normally for over forty eight hours..

     

    Root Cause

    A severe weather event, including lightning strikes, occurred near one of Microsoft’s South Central US datacenters. This resulted in power voltage fluctuations that impacted the datacenter cooling systems. Automated procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power-down process. Microsoft will provide a formal Root Cause Analysis (RCA) in the coming week. https://azure.microsoft.com/en-us/status/

     

    Why didn’t eLynx execute the disaster recovery process?

    eLynx leverages many of the Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) components of Microsoft Azure including: Azure SQL database, Azure Site Recovery, Azure Storage, and Cloud Services to mention a few. These services leverage the regional geographic redundancy provided in Azure for disaster recovery. Throughout this outage eLynx worked closely with Microsoft Premier Support in multiple attempts to complete the execution of eLynx's disaster recovery process to the North Central US data center.

     

    Due to the nature of Microsoft's South Central shutdown, key Microsoft services like Azure Active Directory and Visual Studio Team Services were impacted across the country, which inhibited the execution of the entire disaster recovery plan. This culminated in an overload of requests to Microsoft and prevented our ability to control Cloud Services in North Central, a requirement for recovering the frontend and backend of the eLynx application. As services recovered throughout the day, it became apparent that South Central would be operational prior to any other attempt at recovery in North Central.

     

    What are we doing about it?

    The eLynx Team continues to work diligently with Microsoft to remediate the deficiencies discovered from this incident. eLynx strives for excellence and will make any and all changes necessary to improve resiliency and availability for our customers. eLynx is taking the lessons learned from this incident to improve our internal Disaster Recovery procedures in order to mitigate risks of a similar event from happening in the future.

Please sign in to leave a comment.