High availability in practice

Overview

Resilience testing of our STG environment led to bug #3743, among others, and motivate a review of our HA implementation in practice. What follows are the notes from that testing. These notes should provide context to upcoming Pull Requests, and will be referenced when the build book (RDIMS 14981366) is update.

Notes

ARM Deploy

  • NEW internal web load balancer

  • ./addUsersToLocalGroups.ps1 now accepts internal web LB IP as parameter

    • Need to look this up in azure portal

    • Required to balance twinned web VMs from HOSTS or domain DNS

  • Confirm tcgis.ca cname points to load balancer, not web0

    • set jshweb0 to jshweb2,

    • set jshself to jshweb0 to ensure this

DSC Deploy

  • Use new parameter to addUsersToLocalGroups.ps1

.\addUsersToLocalGroups.ps1 -egis_env stg -web_internal_lb_ip "xxx.xxx.xxx.xxx"

  • Check that all VM hosts files are updated to point to new internal LB

    • Single VM deployments can point to web0 still

    • done by addUsersToLocalGroups.ps1, but should confirm

    • WEB* VMs set hosts to themselves

  • https://jshweb.tcgis.ca/portal/portaladmin/system/properties

    • Update privatePortalUrl to use web adaptor via public URL

    • Jon has a task to add this to Site Config function app

{ "WebContextURL":"https://jshweb.tcgis.ca/portal", "privatePortalURL":"https://jshweb.tcgis.ca/portal" }

VM failure testing

 

Test 0: turn off WEB0

  • Check that federated server was valid in portal org settings: pass

  • Check that server relational datastore was valid: pass

  • Reload existing hosted feature service: pass

  • Delete and recreate hosted feature service: pass
    Test 0.5: turn on WEB0, Turn off WEB1

  • Done at same time, so some temp downtime

    • Waited until NLB showed full availability

  • No noticable difference to end user

  • Map existing hosted feature layer: pass

Test 1: Turn off POR1/Standby

  • Turned off "Portal for ArcGIS" service, not VM

  • Started 2:14

  • Check that federated server was valid in portal org settings: pass

  • Check that server relational datastore was valid: pass

  • Mapped existing layer
    !!! Have not repointed privatePortalUrl...doing now and repeat
    !!!! While POR1 is down...fine, set when service restarted

  • Check that federated server was valid in portal org settings: pass

  • Check that server relational datastore was valid: pass

  • Mapped existing layer: pass

Test 1a: Turn on POR1/Standby

  • Reconnect to portal site

    • time: A few minutes

  • Mapped existing layer: pass

  • Seems a bit slow, no numbers to support

Test 2: Turn off POR0/Primary, service only

  • Service turned off at 8:16

  • EGIS responsive at 8:18

  • Check that federated server was valid in portal org settings: pass

  • Check that server relational datastore was valid: pass

  • Mapped existing layer: pass

  • Delete and re-create hosted feature service: pass

Test 2a: Turn ON POR0 (now standby), service only

Test 3: Turn off POR0 (standby) VM

Test 3a: Turn on POR0 (standby) VM

  • Turned on 9:03

    • VM started 9:04

  • 9:06 site status ready

Test 4: Turn off POR1 (primary) VM

  • Turned off 9:10

  • 9:11 POR0 now primary (about 90s)

    • POR1 status not ready

    • Site available

  • Check that federated server was valid in portal org settings: pass

  • Check that server relational datastore was valid: pass

  • Mapped existing layer: pass

  • Delete and re-create hosted feature service: pass

Test 4a: Turn on POR1 (now standby) VM

  • Turned on 9:17

    • VM started 9:18 (~60s)

  • Site available throughout

  • POR1 status ready 9:20 (~90s)

  • Mapped existing layer: pass

Test 5: Turn off GIS0 VM

  • Turned off 10:41

    • stopped 10:42

  • Listed as stopped in server manager 10:42

  • Mapped existing layer: pass

  • Delete and re-create hosted feature service: pass

Test 5a: Turn on GIS0 VM

  • Turned on 10:51

    • Started 10:52 (~60s)

  • Status STARTED 10:54 (~2.4minutes)

 

Data store fail-over

Jiehua has discovered the key to ensuring that the datastore will fail over correctly, as noted in https://dev.azure.com/TCOPP/EGIS/_workitems/edit/3830. The key portion being:

(This step is provided by ESRI Inc. key to the success of Data store failover) Manually change the datastore property file on both ds0 and ds1, C:\ArcGIS\DataStore\framework\etc\datastore.properties, failover_on_primary_stop=true. The default setting is false. then restart the data store services. 

Notes about HA Portal

  • Portal uses an internal postgres db

  • HA portal uses a primary/standby system for maintaining this

    • standby portal connects to primary portal db

  • Turning off or turning on at the same time can cause deadlock

  • https://community.esri.com/thread/189085-portal-ha-105-issue

    • Refers to 10.5 but I've yet to find anything more current

Making Portal fail

There is an order of operations that will cause the portal site to lose integrity and become unavailable. This is references in the GeoNet thread above. The following steps may produce such a state:

Attempt #1

Result: After a few minutes the site recovered, same primary. No Problem

Attempt #2

Perform as Azure VM Restart on both machines simultaneously.

  • After 2 minutes web adaptor gives “Could not access portal machines” message

  • After 10 minutes no response

  • After 15 minutes both portals are responding

Don’t restart both portal VMs at the same time.

Don’t restart both datastore VMs at the same time

Deployment changes

Added an internal (private IP) load balancer to the WEB VMs. Instructions notes at the top of the page.

  • VM Hosts was just pointing to WEB0, or

  • Cloud domain controller was pointing to WEB0

  • Updated hosts entry setting jshweb.tcgis.ca to webinternal load balancer

  • On web machines the hosts entry was set to its own IP

    • Did not try loopback/127.0.0.1 but that should work too

Confirming updated addUsersToLocalGroups

.\addUsersToLocalGroups.ps1 -egis_env stg -web_internal_lb_ip "xxx.xxx.xxx.xxx"

LOGS

C:\portalforarcgis\content\arcgisportal\logs\EGIS-JSH-POR0.CLOUD.TC.GC.CA\portal