Restore 2019-10-18

Overview

Earlier this week the eGIS team noticed that backups were not being performed on our Pilot instance. The end result was that Server was corrupt and a restore was required

Cause

Jon, Josh, and Jiehua, as well as an ESRI employee, had inspected logs and the most likely scenario is as follows: The C drive on the server machine ran out of space. This meant that AGS could not write state as required, and the result was a corrupted server that could not start after a reboot.

Remediation

The OS Disk on the server machine was resized immediately. This was insufficient to kick-start the server to a running state.

ARM Template Updates

All OS and Data Disks were resized via ARM template deployment. This required the ARM templates to be refactored to list the disks as their own resource, as opposed to the short form within the VM resource.

ESRI PowerShell DSC

The PLT template was modified to point to the File Share on DS0. Despite running as cloud\arcgisservice, which should have permissions, the script failed create a site saying that the config folder was not writable. After a few attempts this was aborted and the folder c:\ags\ was used as the root for config-store and directories.

Uninstalling via DSC

The server software was uninstalled via DSC as follows (see repository for template):

Configure-ArcGIS ` -ConfigurationParametersFile 'D:\DSC\PowerShell DSC\PLT-UNINSTALL_GIS-ONLY.json' ` -Mode UnInstall -DebugSwitch

Re-installing Server

With just the server uninstalled, an attempt was made to re-install just server using the same template. The script ran, but the configuration wasn’t right.

  • The server was already federated and the web adaptor in place. URL checked passed, although the new server had new secrets for token generation, and the federation was not right

    • Some attempts to unfederate/re-federate failed

    • Using a fresh web adapter pointing to /arcgis instead of /server was promising, but the portal didn’t recognize the existing data store as an ESRI data store.

 

Re-installing everything

 

This uninstalled software, but left all config and content folders in place. Remove them before uninstalling.

Configure-ArcGIS ` -ConfigurationParametersFile 'D:\DSC\PowerShell DSC\PLT-BaseDeployment-MultiMachine_DomainController.json' ` -Mode UnInstall

 

Lessons learned

Repair

In the end, it would have been quicker to just wipe out the VMs and start from scratch. The DSC scripts can reset the configurations on existing instance to what they should be according to the master branch, but replacing just the Server with a fresh install while leaving the rest didn’t work.

Monitoring

We need more awareness of our VMs. We have collaborated with the cloud team to configure alerting when the default VM health metrics have been violated.

Staying up-to-date

The restore process required running the versioned scripts for when pilot was installed. Some steps that were required for this are redundant in the current deployment. We should move forward with this upgrade.