Welcome to Shaun Luttin's public notebook. It contains rough, practical notes. The guiding idea is that, despite what marketing tells us, there are no experts at anything. Sharing our half-baked ideas helps everyone. We're all just muddling thru. Find out more about our work at bigfont.ca.

Azure Web App Service and Portal downtime analysis 06 October 2017

Tags: azure, azure-web-apps

TLDR;

Eight of our Azure Web Apps went down for 23 hours. In additional, several parts of the Azure portal became non-responsive. The problem was a file server that got into a bad state and had to be rebooted.

Problem List

  • Eight of my Azure Web App service sites are down.
  • The SCM is not accessible for any of them.
  • The Web App Service "Deployment options" blade does not work.
  • The Web App Service "Diagnostics logs" blade does not work.
  • The Azure Status pages says that all services are running normally.

7:31 PM PST - 6 Oct 2016

I notice that the Azure Web App Deployment options button is non-responsive for more than one of my sites and share this via Twitter. I replicate the problem in Edge, Firefox, and Chrome with private/incognito windows. I capture this console output in Chrome:

Document was loaded from Application Cache with manifest https://portal.azure.com/?feature.appcache=true&l=en.en-us&cdnIndex=4
Application Cache Checking event
Application Cache NoUpdate event

         _    _____   _ ___ ___
        /_\  |_  / | | | _ \ __|           Want to build awesome
  _ ___/ _ \__/ /| |_| |   / _|___ _ _     web applications
(___  /_/ \_\/___|\___/|_|_\___| _____)    like this one?
   (_______ _ _)         _ ______ _)_ _    Join us.
          (______________ _ )   (___ _ _)  http://aka.ms/BuildThis


5.0.302.509 (production#e66c4a9.161006-1215)
Session: 8fed15f9284c423e8d86d35209d3491f
Document was loaded from Application Cache with manifest https://portal.azure.com/AzureHubs/Hubs?feature.appcache=true&l=en.en-us&cdnIndex=4
Application Cache Checking event
Application Cache NoUpdate event
Adding master entry to Application Cache with manifest https://stamp2.app.insightsportal.visualstudio.com/InsightsExtension?feature.appcache=true&l=en.en-us&cdnIndex=4
Adding master entry to Application Cache with manifest https://web1.appsvcux.ext.azure.com/websites/WebsitesContent/WebsitesIndex?feature.appcache=true&l=en.en-us&cdnIndex=4
Application Cache NoUpdate event
Adding master entry to Application Cache with manifest https://insights1.exp.azure.com/insights/InsightsContent/InsightsIndex?feature.appcache=true&l=en.en-us&cdnIndex=4
Application Cache NoUpdate event
[fx]  8:04:07 PM MsPortalImpl/Base/Base.Selectable2 Base.Selectable2: message: Cannot read property 'supplyInitialData' of undefined
stack: TypeError: Cannot read property 'supplyInitialData' of undefined
    at i (https://portal.azure.com/Content/Dynamic/MsPortalFx_C2AE501C18F3AB18576CCB5119954C1FE01266B9.js:261:512)
    at new i (https://az772487.vo.msecnd.net/websites/Content/Dynamic/AmdBundleDefinition???=**WebsitesExtension/TypeScript/_generated/Blades/TroubleshootBlade:42:147)
    at https://az772487.vo.msecnd.net/websites/Content/Dynamic/AmdBundleDefinition???js?root=*WebsitesExtension/TypeScript/Search/WebsiteBrowseViewModel:15:860
    at h (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:235:453)
    at e.promiseDispatch.s (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:235:731)
    at t.f.promiseDispatch (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:229:931)
    at https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:228:676
    at MessagePort.t (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:233:201)

[HubsExtension]  8:04:13 PM MsPortalFx.Base.Diagnostics.ErrorReporter 1 MsPortalFx.Base.Diagnostics.ErrorReporter: message: Cannot read property 'supplyInitialData' of undefined
stack: TypeError: Cannot read property 'supplyInitialData' of undefined
    at i (https://portal.azure.com/Content/Dynamic/MsPortalFx_C2AE501C18F3AB18576CCB5119954C1FE01266B9.js:261:512)
    at new i (https://az772487.vo.msecnd.net/websites/Content/Dynamic/AmdBundleDefinition???=**WebsitesExtension/TypeScript/_generated/Blades/TroubleshootBlade:42:147)
    at https://az772487.vo.msecnd.net/websites/Content/Dynamic/AmdBundleDefinition???js?root=*WebsitesExtension/TypeScript/Search/WebsiteBrowseViewModel:15:860
    at h (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:235:453)
    at e.promiseDispatch.s (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:235:731)
    at t.f.promiseDispatch (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:229:931)
    at https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:228:676
    at MessagePort.t (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:233:201)

[WebsitesExtension]  8:04:15 PM MsPortalFx.Base.Diagnostics.ErrorReporter 1 MsPortalFx.Base.Diagnostics.ErrorReporter: message: Cannot read property 'supplyInitialData' of undefined
stack: TypeError: Cannot read property 'supplyInitialData' of undefined
    at i (https://portal.azure.com/Content/Dynamic/MsPortalFx_C2AE501C18F3AB18576CCB5119954C1FE01266B9.js:261:512)
    at new i (https://az772487.vo.msecnd.net/websites/Content/Dynamic/AmdBundleDefinition???=**WebsitesExtension/TypeScript/_generated/Blades/TroubleshootBlade:42:147)
    at https://az772487.vo.msecnd.net/websites/Content/Dynamic/AmdBundleDefinition???js?root=*WebsitesExtension/TypeScript/Search/WebsiteBrowseViewModel:15:860
    at h (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:235:453)
    at e.promiseDispatch.s (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:235:731)
    at t.f.promiseDispatch (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:229:931)
    at https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:228:676
    at MessagePort.t (https://portal.azure.com/Content/Dynamic/MsPortalFxStable_AD3A97F8E4DFEDF813AEDF50EB2617D3D2774D04.js:233:201)

GET https://management.azure.com/subscriptions/XXXXX-XXXX-XXXX-XXXX-XXXXX???erverfarms/DefaultServerFarm/usages?api-version=2015-08-01&_=1475809435244 408 (Request Timeout)
POST https://web1.appsvcux.ext.azure.com/websites/api/Websites/GetScmInfo 408 (Request Timeout)
POST https://web1.appsvcux.ext.azure.com/websites/api/Websites/GetApplicationDiagnosticsSettings 408 (Request Timeout)
Application Cache Checking event
Application Cache NoUpdate event
Application Cache Checking event
Application Cache NoUpdate event

8:21 PM PST - 6 Oct 2016

I notice that the following eight sites are down in my Azure App Services despite my not having done anything to any of these sites in several weeks.

  1. de-en.azurewebsites.net
  2. gardenfaire2.azurewebsites.net
  3. lrj-global.azurewebsites.net
  4. orchard-theme-machine-designer.azurewebsites.net
  5. singular-biogenics.azurewebsites.net
  6. tsokh.azurewebsites.net
  7. ssiproud.azurewebsites.net
  8. zolob.azurewebsites.net

In contrast, bigfont1.azurewebsites.net is up, and shares both the same Azure subscription and the following properties:

  • Resource group: Default-Web-WestUS
  • App Service plan/pricing tier: DefaultServerFarm
  • Location: West US
  • Last change by me: Weeks or months ago

8:30 PM PST - 6 Oct 2016

I raise the issue on social.msdn.microsoft.com. Support suggests clearing the cache, trying other browsers, and troubleshooting Kudu deployments.

8:30 am PST - 07 Oct 2016

A MSFT engineer tells me this:

the API back-end was having some issues last night which is why you were seeing time outs in the UI.

NOTE: This is the very first piece of useful information that I have received from MSFT. It is 13 hours since I asked about the situation.

8:30 am PST - 07 Oct 2016

I open support ticket 116100714772026 from the Azure Portal. The ticket asks about two problems:

  1. The Deployment options button is broken.
  2. Eight of my sites are down.

10:30 am PST - 07 Oct 2016

MSFT Support calls me, gathers information about problem 1, and tells me to open another ticket for problem 2.

NOTE: Technical support at MSFT can handle only one problem per support ticket. If you have two questions, then open two support tickets; otherwise, you will have to wait an additional 2 hours to receive a response about your second question.

10:45 am PST - 07 Oct 2016

To address problem 2, I open support ticket 116100714772641 from the Azure Portal.

NOTE: Eight sites are still down sixteen hours after raising the concern.

11:31 am PST - 07 Oct 2016

Direct navigation to the de-en.azurewebsites.net site gives an HTTP 404. Investigation in the Azure portal shows these results:

  1. Overview > Monitoring: 0 HTTP Server Errors.
  2. Activity log > Timespan Last 2 weeks: No results to display.
  3. Diagnose and solve problems > Resource Health: Available. NOTE: It takes about 75 seconds for the blade to display itself.
  4. Diagnostic logs: The blade opens but is non-responsive. It instead displays "WEBSITELOGSPART."

12:15 pm PST - 07 Oct 2016

MSFT technical support contacts me about the second ticket. The diagnosis is that DefaultServerFarm has run out of memory because it has 29 Web Apps and 07 Slots. (This turns out to be wrong).

In other words, support thinks we have a problem running out of RAM. (I can see this in the Metrics per Instance for the App Service Plan). In the short term, support says I can scale up e.g from the S1 to S2 App Service Plan.

I try this and receive the following error:

Failed to update App Service plan DefaultServerFarm: Heuristics indicate WebApiClient request timed out. Uri: https://management.azure.com/subscriptions/XXXX-XXXX-XXXX-XXXX-XXXXX/resourcegroups/Default-Web-WestUS/providers/Microsoft.Web/serverfarms/DefaultServerFarm?api-version=2014-06-01 Timeout: 00:01:00

I try a second time and receive the following error:

Failed to update App Service plan DefaultServerFarm: {"Code":"Conflict","Message":"Cannot modify this web hosting plan because another operation is in progress. Conflicting operation details: Id: XXXX-XXXX-XXXX-XXXX-XXXX, OperationName: UpdateServerFarm, CreatedTime: 10/7/2016 7:29:17 PM, WebSystemName: websites, SubscriptionName: XXXX-XXXX-XXXX-XXXX-XXXX, WebspaceName: westuswebspace, SiteName: , SlotName: , ServerFarmName: DefaultServerFarm","Target":null,"Details":[{"Message":"Cannot modify this web hosting plan because another operation is in progress. Conflicting operation details: Id: XXXX-XXXX-XXXX-XXXX-XXXX, OperationName: UpdateServerFarm, CreatedTime: 10/7/2016 7:29:17 PM, WebSystemName: websites, SubscriptionName: XXXX-XXXX-XXXX-XXXX-XXXX, WebspaceName: westuswebspace, SiteName: , SlotName: , ServerFarmName: DefaultServerFarm"},{"Code":"Conflict"},{"ErrorEntity":{"Code":"Conflict","Message":"Cannot modify this web hosting plan because another operation is in progress. Conflicting operation details: Id: XXXX-XXXX-XXXX-XXXX-XXXX, OperationName: UpdateServerFarm, CreatedTime: 10/7/2016 7:29:17 PM, WebSystemName: websites, SubscriptionName: XXXX-XXXX-XXXX-XXXX-XXXX, WebspaceName: westuswebspace, SiteName: , SlotName: , ServerFarmName: DefaultServerFarm","ExtendedCode":"11008","MessageTemplate":"Cannot modify this web hosting plan because another operation is in progress. Conflicting operation details: {0}","Parameters":["Id: XXXX-XXXX-XXXX-XXXX-XXXX, OperationName: UpdateServerFarm, CreatedTime: 10/7/2016 7:29:17 PM, WebSystemName: websites, SubscriptionName: XXXX-XXXX-XXXX-XXXX-XXXX, WebspaceName: westuswebspace, SiteName: , SlotName: , ServerFarmName: DefaultServerFarm"],"InnerErrors":null}}],"Innererror":null}

Since those errors are happening, support suggests waiting an hour and then trying a third time.

After having waited for an hour, support confirms that we succeeded in scaling up to more RAM. Unfortunately, this did not resolve the problem. Despite having a lot of RAM, we still have the original problems on all eight sites.

4:09 PM PST - 07 Oct 2016

I tweet to @davidebbo to ask if he know what's happening. He commits to investigate.

4:55 PM PST - 7 Oct 2016

The original problems remain unresolved. Professional support tells me that there is nothing more they can do for now.

5:00 PM PST - 07 Oct 2016

@davidebbo replies to me that it's fixed it and indeed it is.

My Concerns

  • Will MSFT billing compensate me for the 24-hours of downtime and personal troubleshooting work? Will I need to ask for this compensation or will it be forthcoming?
  • Will MSFT billing charge me for the two professional support tickets that are related to this problem?
  • Will MSFT billing charge me for scaling up the site to "fix the RAM problem"? Will I need to ask them not to charge me or will someone handle this for me?
  • Will MSFT fix the Azure Status page, so that it shows problems when problems exist?