High availability in practice
- 1 Overview
- 2 Notes
- 2.1 ARM Deploy
- 2.2 DSC Deploy
- 2.3 VM failure testing
- 2.3.1 Test 0: turn off WEB0
- 2.3.2 Test 1: Turn off POR1/Standby
- 2.3.3 Test 1a: Turn on POR1/Standby
- 2.3.4 Test 2: Turn off POR0/Primary, service only
- 2.3.5 Test 2a: Turn ON POR0 (now standby), service only
- 2.3.6 Test 3: Turn off POR0 (standby) VM
- 2.3.7 Test 3a: Turn on POR0 (standby) VM
- 2.3.8 Test 4: Turn off POR1 (primary) VM
- 2.3.9 Test 4a: Turn on POR1 (now standby) VM
- 2.3.10 Test 5: Turn off GIS0 VM
- 2.3.11 Test 5a: Turn on GIS0 VM
- 2.4 Data store fail-over
- 2.5 Notes about HA Portal
- 2.5.1 Making Portal fail
- 2.5.1.1 Attempt #1
- 2.5.1.2 Attempt #2
- 2.5.1 Making Portal fail
- 2.6 Deployment changes
- 2.7 LOGS
Overview
Resilience testing of our STG environment led to bug #3743, among others, and motivate a review of our HA implementation in practice. What follows are the notes from that testing. These notes should provide context to upcoming Pull Requests, and will be referenced when the build book (RDIMS 14981366) is update.
Notes
ARM Deploy
NEW internal web load balancer
./addUsersToLocalGroups.ps1 now accepts internal web LB IP as parameter
Need to look this up in azure portal
Required to balance twinned web VMs from HOSTS or domain DNS
Confirm tcgis.ca cname points to load balancer, not web0
set jshweb0 to jshweb2,
set jshself to jshweb0 to ensure this
DSC Deploy
Use new parameter to addUsersToLocalGroups.ps1
.\addUsersToLocalGroups.ps1 -egis_env stg -web_internal_lb_ip "xxx.xxx.xxx.xxx"
Check that all VM hosts files are updated to point to new internal LB
Single VM deployments can point to web0 still
done by addUsersToLocalGroups.ps1, but should confirm
WEB* VMs set hosts to themselves
https://jshweb.tcgis.ca/portal/portaladmin/system/properties
Update privatePortalUrl to use web adaptor via public URL
Jon has a task to add this to Site Config function app
{
"WebContextURL":"https://jshweb.tcgis.ca/portal",
"privatePortalURL":"https://jshweb.tcgis.ca/portal"
}
https://jshweb.tcgis.ca/portal/portaladmin/federation/servers
Ensure URL and admin url match and use the web adaptor public URL for server
VM failure testing
Test 0: turn off WEB0
Check that federated server was valid in portal org settings: pass
Check that server relational datastore was valid: pass
Reload existing hosted feature service: pass
Delete and recreate hosted feature service: pass
Test 0.5: turn on WEB0, Turn off WEB1Done at same time, so some temp downtime
Waited until NLB showed full availability
No noticable difference to end user
Map existing hosted feature layer: pass
Test 1: Turn off POR1/Standby
Turned off "Portal for ArcGIS" service, not VM
Started 2:14
2:15 portaladmin loads
2:16 response "Site not ready" from https://jshweb.tcgis.ca/portal/portaladmin/machines/status/EGIS-JSH-POR1.cloud.tc.gc.ca
Check that federated server was valid in portal org settings: pass
Check that server relational datastore was valid: pass
Mapped existing layer
!!! Have not repointed privatePortalUrl...doing now and repeat
!!!! While POR1 is down...fine, set when service restartedCheck that federated server was valid in portal org settings: pass
Check that server relational datastore was valid: pass
Mapped existing layer: pass
Test 1a: Turn on POR1/Standby
Reconnect to portal site
time: A few minutes
Mapped existing layer: pass
Seems a bit slow, no numbers to support
Test 2: Turn off POR0/Primary, service only
Service turned off at 8:16
EGIS responsive at 8:18
https://jshweb.tcgis.ca/portal/portaladmin/machines notes that POR1 is now primary
Check that federated server was valid in portal org settings: pass
Check that server relational datastore was valid: pass
Mapped existing layer: pass
Delete and re-create hosted feature service: pass
Test 2a: Turn ON POR0 (now standby), service only
Turned on 8:28
8:31 POR0 status is read: https://jshweb.tcgis.ca/portal/portaladmin/machines/status/EGIS-JSH-POR0.cloud.tc.gc.ca
Test 3: Turn off POR0 (standby) VM
Turned off VM 8:59
Site available at 9:00 (about 90seconds)
9:01 POR0 status not ready: https://jshweb.tcgis.ca/portal/portaladmin/machines/status/EGIS-JSH-POR0.cloud.tc.gc.ca
Test 3a: Turn on POR0 (standby) VM
Turned on 9:03
VM started 9:04
9:06 site status ready
Test 4: Turn off POR1 (primary) VM
Turned off 9:10
9:11 POR0 now primary (about 90s)
POR1 status not ready
Site available
Check that federated server was valid in portal org settings: pass
Check that server relational datastore was valid: pass
Mapped existing layer: pass
Delete and re-create hosted feature service: pass
Test 4a: Turn on POR1 (now standby) VM
Turned on 9:17
VM started 9:18 (~60s)
Site available throughout
POR1 status ready 9:20 (~90s)
Mapped existing layer: pass
Test 5: Turn off GIS0 VM
Turned off 10:41
stopped 10:42
Listed as stopped in server manager 10:42
Mapped existing layer: pass
Delete and re-create hosted feature service: pass
Test 5a: Turn on GIS0 VM
Turned on 10:51
Started 10:52 (~60s)
Status STARTED 10:54 (~2.4minutes)
Data store fail-over
Jiehua has discovered the key to ensuring that the datastore will fail over correctly, as noted in https://dev.azure.com/TCOPP/EGIS/_workitems/edit/3830. The key portion being:
(This step is provided by ESRI Inc. key to the success of Data store failover) Manually change the datastore property file on both ds0 and ds1, C:\ArcGIS\DataStore\framework\etc\datastore.properties, failover_on_primary_stop=true. The default setting is false. then restart the data store services.
Notes about HA Portal
Portal uses an internal postgres db
HA portal uses a primary/standby system for maintaining this
standby portal connects to primary portal db
Turning off or turning on at the same time can cause deadlock
https://community.esri.com/thread/189085-portal-ha-105-issue
Refers to 10.5 but I've yet to find anything more current
Making Portal fail
There is an order of operations that will cause the portal site to lose integrity and become unavailable. This is references in the GeoNet thread above. The following steps may produce such a state:
Attempt #1
Note which portal is currently the primary
Stop both portal VMs simultaneously
Start the standby portal machine
Wait 15 seconds
Start the former primary portal machine
Result: After a few minutes the site recovered, same primary. No Problem
Attempt #2
Perform as Azure VM Restart on both machines simultaneously.
After 2 minutes web adaptor gives “Could not access portal machines” message
After 10 minutes no response
After 15 minutes both portals are responding
Don’t restart both portal VMs at the same time.
Don’t restart both datastore VMs at the same time
Deployment changes
Added an internal (private IP) load balancer to the WEB VMs. Instructions notes at the top of the page.
VM Hosts was just pointing to WEB0, or
Cloud domain controller was pointing to WEB0
Updated hosts entry setting jshweb.tcgis.ca to webinternal load balancer
On web machines the hosts entry was set to its own IP
Did not try loopback/127.0.0.1 but that should work too
Confirming updated addUsersToLocalGroups
.\addUsersToLocalGroups.ps1 -egis_env stg -web_internal_lb_ip "xxx.xxx.xxx.xxx"
LOGS
C:\portalforarcgis\content\arcgisportal\logs\EGIS-JSH-POR0.CLOUD.TC.GC.CA\portal