Load testing

Overview

Load testing has been attempted using the package from K6.io. Output has been written to an InfluxDB database, and visualizations have been configured using Graphana. Tests can be viewed at https://dev.azure.com/TCOPP/_git/EGIS?path=%2FARM%2Floadtest&version=GBfeatures%2Floadtest%2Fk6, specifically egisloadtest.js. See the https://dev.azure.com/TCOPP/_git/EGIS?path=%2FARM%2Floadtest%2Fvm%2Freadme.md&_a=preview in that folder for details on how to run these.

With this framework in place, the performance of OIM layers and TCOMS has been evaluated.

Repeating these tests

The tests shown here are available in version control as mentioned in the overview. These tests can be run from your Windows workstation by installing the K6 Windows Binaries.

The Linux VM is only used to track and graph test results. Details on accessing or recreating this VM are contained in the readme.md file in version control.

Sandbox

Testing has been performed on the SBX instance, consisting of an Azure VM for each of the Web and Portal, GIS Server, and Data Store roles. Note that SBX is behind a firewall maintained by the cloud team, and our loadtest VM is not whitelisted to access SBX at the time of writting. All tests were performed on the TC network. The results follow:

The results show surprisingly good results from the Sandbox machine.

 

It was noticed that SBX VMs were undersized compared to Dev, and that accelerated networking wasn’t enabled. AN is an Azure Feature that was released after Sandbox was created. The VMs were resized and accelerated networking was enabled.

az login Select-AzSubscription -Subscription 0921cb70-a968-4860-a0dc-7f77ba3a35c4; az network nic update ` --name tc-egis-sbx-ds592 ` --resource-group TC-Sandbox-ArcGIS-RG ` --accelerated-networking true az network nic update ` --name tc-egis-sbx-web95 ` --resource-group TC-Sandbox-ArcGIS-RG ` --accelerated-networking true az network nic update ` --name tc-egis-sbx-gis702 ` --resource-group TC-Sandbox-ArcGIS-RG ` --accelerated-networking true az network nic update ` --name egis-sbx-geoevt44 ` --resource-group TC-Sandbox-ArcGIS-RG ` --accelerated-networking true

The load testing was repeated at the same time of day, although the second round was done through the VPN from a home connection. The results are as follows:

 

These results look around the same despite the improved hardware. To clear this confusion, the tests were repeated on the TC network in the early morning. The results were more promising:

Here we see SBX handle a 200 user load with acceptable response times.

Dev

Testing has been performed on the DEV instance, consisting of an Azure VM for each of the Web and Portal, GIS Server, and Data Store roles.

Portal

Portal has been tested on the Dev instance by creating an increasing number of virtual users and logging the response time from subsequent requests to the portal home page, portal CSS, and a hosted image. The pilot instance splits the Web Adapter and Portal roles into their own VMs.

http://egis-loadtest.canadaeast.cloudapp.azure.com:3000/d/1-JJY8cWz/load-test?orgId=1&from=1568645278198&to=1568645322670

This test was performed from a developer laptop on the TC network. We can see that average response time (green), remains under 2.5 seconds with a load of just over 200 simultaneous users. When another block of virtual users are added, the portal begins to fail

Check if portal did fail or if the networking on my laptop was saturated.

When running a similar test from an Azure virtual machine, we see the following results:

http://egis-loadtest.canadaeast.cloudapp.azure.com:3000/d/1-JJY8cWz/load-test?orgId=1&from=1568656908296&to=1568656971010

This takes a lot, but not all, of the network considerations our of the equation (Azure Canada East calling Canada Central). We see that our Dev portal can handle over 100 users nicely, and over 200 users reasonably well.

Server

Similar tests were devised to stress the server by querying a feature service, fetching a feature, and exporting a web map. The following tests were performed on the azure load testing VM and called the Dev instance.

http://egis-loadtest.canadaeast.cloudapp.azure.com:3000/d/1-JJY8cWz/load-test?orgId=1&from=1568653289948&to=1568653330206

These graphs show that the server performance starts off slow and gets worse. We’ll need to improve these numbers.

Azure Monitor

Azure monitor showed that the CPU on the GIS machine could handle the load it was burdened with. Over the afternoon of testing, the CPU never crossed 80% usage.

 

Similarly, the DS machine had no CPU limitations:

 

Pilot

The same tests were run on Pilot and the results were as follows:

The test were run from TC, then Azure. We see that portal has no issue handling up to 200 users. From the server side, we see a warm up or caching phase, then load is handled reasonably until there are over 200 virtual users.

Differences between Dev and Pilot

The GIS VMs are difference between our Dev and Pilot instances. EGIS-DEV-GIS0 uses a D4S_v3 image that provides a maximum of 6400 IOPS. Pilot uses a DS3_v2 image that provides a maximum of 12800 IOPS. Twice the performance.

Geo Event Server

TODO

 

OIM Layers

The layers in the Emergency Management Group on the Sandbox instance were sorted by views and the most popular layers were inspected for performance. These layers include

  • Flood_Impacts_to_Federal_Property

  • IMS_DEV_EVENT_PTS

  • Buoys_and_Lights

  • TC_Real_Property

  • AISTClassA_TC

The first 4 layers are feature layers and the AIS layer is a map service. A number of layers listed in the EMG group were hosted by 3rd parties and were not evaluated.

A low number of users was used to get an idea of base performance per layer, and to see if any particular layers warrant a close look. The results were as follows:

A quick look shows that Flood Impacts would be a good first layer to have a look at.

Flood Impacts of Federal Real Property

The layer in question is located here: https://sbxweb.tcgis.ca/portal/home/item.html?id=459b8810542d4cc7a4e24aa65d741f3a

This layer is the result of an analysis that has been saved as a hosted layer. In our initial look it stood out as being noticeably slow. The reason for this is two-fold. First, flood data consists of complex polygons, with many points to create the curved boundaries, being multipart polygons, and containing holes. The second, this layer is viewable at a national scale despite the polygon features being too small to see at that scale.

As a demonstration, this layer was disolved using the eGIS Analysis Tools to join all polygons that share the same property ID and date. The result was a layer that contained less than half of the features of the original (450 versus 201) yet looked the same.

When the two layers were subjected to the performance tests above, the dissolved layer was noticeably faster.

This was an exercise to show that the decision of how to represent a data set can greatly affect the performance of its service. In this instance, using the simpler polygons of the Real Property data may be as appropriate as using an intersection with flood waters. Creating a related points layer to use at small scales would allow users to find affected properties visually. While being unsure of the goals of the layer, showing only the current floodwater, or setting a target time value, may also simplify the data shown.

Feature Collections

The Emergency Management Group has a number of Feature Collection layers, including some for Alberta Fire data. In the Alberta Fire case, each layer is published per year. Feature collections appear to offer good performance for publishing a small amount of feature data.

In the Alberta Fire case, these yearly collections could be the intermediate step in gathering a data set published only as yearly summaries, with the intent of publishing a complete data set. But, these layers could be the result of a user searching to fix the poor performance a larger, more comprehensive layer.

Seeing a number of repeated layers that are only differentiated by some attribute is a sign that some guidance may be needed.

 

TCOMS - Integrated Web Application

TCOMS is a web application which integrates maps and lists spatial data from the eGIS platform. The app interacts with eGIS in the following ways:

A request has been sent to the TCOMS team for feedback on performance and suggested areas for improvement.

GeoResources

This page no longer makes references to the eGIS sandbox. The main HTML document takes around 6 seconds to load. It is assumed that calls are made to the eGIS portal on the server side. It may be a better experience to load the document quickly, and make the API calls on the client side.

COP

This web map performs moderately well. The top contributors to loading time are:

  • COP - main, HTML document (2.85s)

  • dojo-lite.js - ESRI JSAPI (754ms)

  • dojo_en-us.js - Translation file (543ms)

This was inspected with Chrome Dev tools and it should be noted that the JavaScript library and translation would be cached and respond quickly on repeated requests.

Event details

The top loading contributors are:

  • Main HTML document (3.58s)

  • ArcGIS Online Hill shade (useful for a 2D map?) (635ms)

  • IMS_DEV_EVENT_PTS query (535ms)

Performance Aside, the event layer should probably be filtered to the target event as I’me seeing more than one. In fact, just a point may be sufficient as other details are already on the page. See this tutorial for an approach: https://developers.arcgis.com/javascript/latest/sample-code/intro-graphics/index.html

Suggestions

Platform

Some gains were made by resizing the VMs in sandbox to align with the images used in other instances. Accelerated Networking was also enabled on Sandbox. These the images are defined in our ARM templates and are already available to future instances.

Disk size has an effect on disk performance. The current standard disk size is 128GB. The following table shows current disk size and speed, as well as the target size and speed for a proposed improvement:

Disk Size (GB)

IOPS Limit

Throughput (MB/s)

Disk Size (GB)

IOPS Limit

Throughput (MB/s)

128

500

100

1024

5000

200

Case study: DS Disk Size

The data store machine on sandbox had its disk resized from 128GB to 1024GB. OIM layer performance was tested before and after this change and the results are shown in the following graph:

We can see that, if nothing else, consistency has improved greatly. Average response times have dropped from over a second to around 500ms.

ArcGIS Enterprise Configuration

It would be worth evaluating the ArcGIS Server’s cache control, as mentioned here:

https://enterprise.arcgis.com/en/server/latest/publish-services/windows/improve-map-service-display-performance.htm

 

Publishing data

These suggestions have been migrated to How to publish spatial data to eGIS

  • Best practices for publishing and symbolizing data must be followed regardless of our hardware.

    • Create map caches where possible

    • Data that can’t be cached should be stored and displayed in the fewest and simplest features possible

    • Follow ESRI’s Performance tips for uncached maps

  • Re-project all data to WGS84 Web Mercator

    • Avoid all re-projection on the fly

    • Debate?

  • Test performance on all layers?

    • Web hooks, test new layers?

Also, when reports of poor performance are made, create a test using this framework, and track the improvement as different options are considered.