Azure Data Lake with ArcGIS Enterprise

 

Overview

Initial research into connection points between Enterprise and Azure Data Lake (ADL). A data lake has been created for researching the topic, detailed in Creating the dev instance.

PowerPoint

See this file for retro demo: Azure Data Lake.pptx

Azure Data Lake

From the Azure website :

Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics. Azure Data Lake works with existing IT investments for identity, management, and security for simplified data management and governance. It also integrates seamlessly with operational stores and data warehouses so you can extend current data applications. We’ve drawn on the experience of working with enterprise customers and running some of the largest scale processing and analytics in the world for Microsoft businesses like Office 365, Xbox Live, Azure, Windows, Bing, and Skype. Azure Data Lake solves many of the productivity and scalability challenges that prevent you from maximizing the value of your data assets with a service that’s ready to meet your current and future business needs.

ADL supports any application that uses the HDFS, and uses Apache YARN for analytics, according to wikipedia .

ADL and TC

From conversations with the enterprise Business Intelligence and Data Analytics (eBIDA) team, TC will be moving towards an Azure Data Lake to house much of TC data Link to meeting recap?

EGIS will need to interact with this data lake in a few way:

  1. Discover data within that data lake required

  2. As a back end for GeoAnalytics server

  3. More?

Discovery

Azure Data Catalog

Azure Data Catalog lets you register data sets. Not standards compliant that I can see, but has a REST API. These aren't available in Canadian regions. This may not be a problem (data residency in place, cataloged/metadata external) but I couldn't create one due to policy from TC.

In emails with MS on the topic, the data catalog has been described as:

[Data Catalog] is about data source discovery, enterprise data systems discovery and enabling value to your business departments and users. Data Catalog allows people in your teams and departments to discover and leverage existing data sources and repositories that each department has created. Data Catalog shows Schema information, technical details, and various other data profiling tools on your data sources.

Azure Search

Azure Search looks like a better way to discover data in a dump-everything-in-the-lake situation. Supports lucene query syntax (what we'd get from voyager search), and their own API. Not sure if we wire this search in directly. There's a tutorial on how to search unstructured data .

In emails with MS on the topic, Azure Search was described as:

[Azure Search + Cognitive search] is enabling data discovery within your data itself, surfacing relevant information up. For example, when a user searches “Audi A4”, PDFs, Images, or textual information on potential faults, recalls, and information about the Audi A4 in the Cars Database that you have attached Azure Search to, will be surfaced up for the user. I suspect this is what you want as the main use case here is attaching this service to your data lake, which will allow you to discover data in the lake this way.

I've attempted to create an Azure Search service to find files in our data lake, but ran into compatability issues

Related

When an ESRI big data file share is created, "a manifest is generated that outlines the format of the datasets within your share location" ref . That manifest is a Big Data Catalog Service

GeoAnalytics

From the ESRI website :

Part of the Esri Geospatial Cloud, ArcGIS GeoAnalytics Server extends the capabilities of ArcGIS Enterprise, providing a diverse collection of analysis tools that enable you to quickly analyze your data across space and time. Data that was previously too big or too complex to analyze can now be deeply examined, understood and used to take action. Leverage the power of multiple servers to get your work done.

Baked in HDFS and ADL support.

The advertised workflow shows inputs from GIS and Big Data running through an Analysis tool, to created hosted feature layer data outputs . In ArcGIS Enterprise 10.7, those outputs can be written back to the big data file share.

The GeoAnalytics server takes a license that, as of writing, we haven't applied to Dev or SBX.

Obvious use cases

  • Historical AIS data

Questions

  • Is out GeoAnalytics BDS the same as the TC ADL?

    • Preferably...

  • Does a Big Data Catalog Service stay up to date when new datasets are added?s

  • Is the speed of an Azure Data Lake sufficient for our needs?

    • Azure SA/File shares didn't work for our initial purposes

    • With Big Data analytics we want to keep the data and processing in the cloud, and display results only

Notes

  • ESRI Big Data File Share, using hadoop

  • ESRI Geo-Analytics Server

    • requires relational and spatiotemporal stores (exist in dev, sbx, but I assume not pointing to azure)

  • ESRI GeoNet little to no content on Hadoop, Big data, data lakes, etc.

Sample Data

Public data exists in a number of places.

Glossary

  • Hadoop: a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models website

  • HDFS: Hadoop Distributed File System description