Premise

As an agency, we have a number of Sitecore solutions running on the cloud, the majority of them running on Windows Azure web roles. Thankfully, Sitecore has made the process of packaging your solution and deploying it to web roles as simple as possible with the Azure module. Among other things, this module does the following to your solution:

  • Performs transformations to your web.config (and included) configuration files using XSLT;
  • Manages your deployment lifecylce (create, stop, upgrade), in both staging and production nodes;
  • Deploys your Sitecore databases to SQL Azure SaaS for you;
  • Keeps all your Azure configuration (e.g. service definition, service configuration) as Sitecore items;
  • Packages your website files into a cspkg image file for deployment

Problem 1

One of our clients requested from us that we do a few new developments. This client's solution was developed in Sitecore 7.0 and uses Sitecore Azure v.3.1. After the development was complete, we fully tested it in our staging environment. After all tests passed successfully, we proceeded to deploy the solution to the Azure web roles. Following the deployment, every single page responded with a 500 error code. Not even a custom error page. We have also set up Elmah on the solution so that an email is sent on every exception, but we didn't even get any emails.

This rang the first bell. It had to be a problem with web.config, so the next logical step was to RDP to the running web role and perform a request locally to see the error. Sure enough, the error was "Cannot read configuration file because it exceeds the maximum file size". Given that the configuration could not be loaded, it easily follows that no custom error page can be presented, and Elmah was not loaded at all.

Searching around the web, I was informed that by default, the maximum file size for web.config is 250kb. Checking on the web role, the file size was 286kb, which obviously exceeds the allowed maximum. But why did the project run fine on our staging server?

Well, it came up that on the staging server, the web.config file was 235kb. The reason behind the difference in file sizes is that it seems Sitecore's Azure module first unifies web.config with all the included files (App_Config/Include) so that you get the result as presented in /sitecore/admin/showconfig.aspx, then applies the web.config XSLT transforms mentioned in the previous section, and then splits it again into multiple configuration files, however using a completely different setup than what you may have chosen during your development: there are separate include files for settings, pipelines, processors, etc. I guess that the reasoning behind the merge/split cycle is so that the XSLT transforms can be applied - you need the complete configuration file before you can apply any XSLT on it.

Well, in order to implement our client's latest requests, I had to define a new custom Lucene index, along with its associated configuration. As you may have seen, index configurations can be quite verbose. In this case, it amounted to ~2000 lines. Index configurations are not given their own separate include file, so this accounted for the explosive growth of the final web.config. Saving the /configuration/sitecore node in its own file and setting configSource on it did not work, as the module's merging code copied the file contents in the main web.config. So there was only one more option: setting a registry setting, as described in this post.

Problem 2

The issue with writing to an Azure web role's registry is how to make it stick. At any time, a web role may be "recycled", which in plain speak means that it will restart its w3 worker process and lose its registry settings. Fortunately, as explained on MSDN, you may instruct the platform to execute a "startup" command line script as soon as the role gets recycled.

So it has to be dead easy, right? Just create a file, e.g. SetConfigSize.cmd, and save it in /App_Data/AzureOverrideFiles/bin, with this content:

reg add HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\InetStp\Configuration /v MaxWebConfigFileSizeInKB /t REG_DWORD /d 1024
reg add HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\InetStp\Configuration /v MaxWebConfigFileSizeInKB /t REG_DWORD /d 1024

Then go to Sitecore Azure's configuration in the content tree, and on the "Service Definition" field for the staging and production nodes, add the XML required to run it:

<Startup>
  <Task commandLine=="SetConfigSize.cmd" executionContext="elevated" taskType="simple" />
</Startup>

Finally, open Sitecore's Azure module, choose your target node (staging/production) and deployment, and perform "upgrade files", which is the Azure module's lingo for packaging the solution in a cspkg image and deploying it to Azure.

Problem 3

Well, we didn't get off that easy. After choosing to upgrade files, the deployment will not start. A visit to the Azure management portal (the legacy portal) showed that the instance was busy, executing startup tasks. And stuck there. To make matters worse, there's no obvious way to cancel the deployment and try again.

First of all, a better look at the above MSDN article reveals that "Startup tasks must end with an errorlevel (or exit code) of zero for the startup process to complete. If a startup task ends with a non-zero errorlevel, the role will not start". Obviously, for some reason, "reg add" does not return errorlevel 0. Additionaly, to organize things a little better, I decided to split the registry settings to their own ConfigSize.reg file:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\InetStp\Configuration]
"MaxWebConfigFileSizeInKB"=dword:400

[HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\InetStp\Configuration]
"MaxWebConfigFileSizeInKB"=dword:400

and then call them from an updated SetConfigSize.cmd:

regedit.exe /s ConfigSize.reg
iisreset
net start w3svc
exit /B 0

The part about iisreset was not in the original .cmd file and was added later in my investigation. To make a long story short, simply setting the registry value is not enough: you need to restart IIS. And since the World Wide Web publishing service is set to manual startup on Azure web roles, you need to start that too. The last bit is important: no matter the outcome of regedit, we choose to return an errorlevel of 0 so that the role actually does start. I had to find that out the hard way (i.e. three "upgrade files" actions later).

The second issue is, how to unblock the startup process which is now stuck in the "busy" state? Well, the answer is in Kevin Williamson's excellent blog post series on Windows Azure troubleshooting. In short, you have to RDP to the running web role (the web role VM is actually running), copy the corrected startup scripts, and then use the Task Manager to kill the WaHostBootstrapper process, which is responsible for the role's startup tasks. A few minutes later, the role will start fine.

And voilà. After the role starts, no more crashes, and the website works perfectly.

Afterthought

It would make sense if Sitecore Azure module's configuration file merging/splitting code saved the entire /configuration/sitecore node into its own config file and loaded it into the main web.config using a configSource attribute, however unfortunately (at least in the 3.1 version of the Azure module) this is not the case. We have not worked with the module's later versions, so it may have been fixed in one way or another. If not, perhaps a workaround such as the one described in this post may be your only option.