Overview
Earlier this week the eGIS team noticed that backups were not being performed on our Pilot instance. The end result was that Server was corrupt and a restore was required
Cause
Jon, Josh, and Jiehua, as well as an ESRI employee, had inspected logs and the most likely scenario is as follows: The C drive on the server machine ran out of space. This meant that AGS could not write it’s state as required, and the result was that server could not start after a reboot.
Remediation
The OS Disk on the server machine was resized immediately. This was insufficient to kick-start the server to a running state.
ARM Template Updates
All OS Disks and the DS data dis were resized via ARM template deployment. This required the ARM templates to be refactored to list the disks as their own resource, as opposed to the short form within the VM resource.
ESRI PowerShell DSC
The PLT template was modified to point to the File Share on DS0. Despite running as cloud\arcgisservice, which should have permissions, the script failed create a site saying that the config folder was not writable. After a few attempts this was aborted and the folder c:\ags\
was used as the root for config-store and directories.
Uninstalling via DSC
The server software was uninstalled via DSC as follows (see repository for template):
Configure-ArcGIS ` -ConfigurationParametersFile 'D:\DSC\PowerShell DSC\PLT-UNINSTALL_GIS-ONLY.json' ` -Mode UnInstall -DebugSwitch
Re-installing Server
With just the server uninstalled, an attempt was made to re-install just server using the same template. The script ran, but the configuration wasn’t right.
The server was already federated and the web adaptor in place. URL checked passed, although the new server had new secrets for token generation, and the federation was not right
Some attempts to unfederate/re-federate failed
Using a fresh web adapter pointing to /arcgis instead of /server was promising, but the portal didn’t recognize the existing data store as an ESRI data store.
Re-installing everything
tall -DebugSwitch
This uninstalled software, but left all config and content folders in place. Remove them before uninstalling.
Configure-ArcGIS ` -ConfigurationParametersFile 'D:\DSC\PowerShell DSC\PLT-BaseDeployment-MultiMachine_DomainController.json' ` -Mode UnInstall
Lessons learned
Repair
In the end, it would have been quicker to just wipe out the VMs and start from scratch. The DSC scripts can reset the configurations on existing instance to what they should be according to the master branch, but replacing just the Server with a fresh install while leaving the rest didn’t work.
Monitoring
We need more awareness of our VMs. A meeting is already scheduled to setup azure monitoring and alerts to prevent this from sneaking up on us in the future.
Staying up-to-date
The restore process required running the versioned scripts for when pilot was installed. Some steps that were required for this are redundant in the current deployment. We should move forward with this upgrade.