Technical considerations of processing client data

 

Overview

Scripting the TC SmartMaps data is progressing and raising questions about how this will work in practice, and how to generalize this approach for other clients.

Currently, data is downloaded from the source directly by our python scripts to a temporary folder, or processed by Lawrence and placed in a network folder. Scripts are run from our workstations, and no scheduling is in place.

Data conventions

Data should be made available to us, or converted to the following format:

  • Data format

    • A zip file containing one FGDB, which contains one feature class and at most one table

  • Naming convention

    • All names match zip

    • No dash or slash characters

Client responsibility

If the client is authoring or collecting new data then it should be maintaining that data directly through an application, or by connecting ArcGIS Pro to the EGIS platform. As part of our onboarding process we'll need to determine and facilitate this data maintenance.

If the client simply requires up-to-date 3rd party data, then we'll work through the eBIDA data onboarding process reference?. If there are existing ETLs we'll work with the client to either take responsibility for these ETLs, or provide a method to output to EGIS in a similar manner to the TC SmartMaps process.

Where does source data sit?

If we're to run our scripts from the cloud then we'll need to access the source data in the cloud.

  • Data could be loaded into blob storage

    • FME has an AzureBlobUpload , so Lawrence or other clients could write blobs directly

  • Can we use the EGIS platform as our source data store?

    • Research would be required to work out a clean approach

    • Has an API, authentication, and some FME support

Jon notes that a storage account per client could be useful for breaking down billing. I'd add that once that data is uploaded to EGIS that it's all in one place.

Blob storage will require our data maintainers have azure accounts and write access to their blobs. The overhead here decreased as TC moves towards joining the cloud and on-prem networks.

Where are scripts run?

Python scripts could from one of our GIS Servers. See Task 2049 Investigate running automation scripts on ArcGIS Server

Based on a cloud-first adoption strategy reference?, we should consider running our maintenance scripts using Azure Functions. We would need to confirm if ArcPy can be used in that context. I don't believe that it can. We would then need to research if we can push enough of the business rule pre-processing to FME to remove our ArcPy dependency.

We could consider FME server to ensure that pre-processing could be run through services and would be available programmatically at any time.

When are scripts run?

We will need to trigger these updates. This could be done through scheduled task, or via an event trigger. We're already checking for metadata freshness by comparing to the online metadata where available.

One possible workflow could look like:

  • Check metadata for updates nightly

  • if data is stale attempt to refresh

  • if can't refresh then contact maintainer

Research topics

  • Can we run ArcPy from Azure Functions

  • Can we use FME and remove the ArcPy dependency from our Python scripts?

  • Does FME server make sense in our workflow?

  • Determine data freshness based on our metadata vs live metadata

  • Determine how to remove deprecated data