Data Export Service alternative using Dataverse Web API and Azure Data Factory

Earlier in the year I was tasked by my client to explore possible alternatives to the soon to be depreciated Dynamics 365 Data Export Service (DES). In this article I have described some of my findings and the high level design of the solution proposed and implemented. The client was using the DES to replicate data from Dataverse to a SQL database to fulfil their corporate reporting needs via a set of Power BI reports and dashboards.

The replacement needed to fulfill the following key aspects:

Match the existing DES solution as close as possible in terms of architecture, technology and performance.
Keep the changes to the Power BI reporting to a minimum.
Keep the costs low considering the fact that DES is virtually free add-on service.
Have a facility to track and synchronise meta-data changes between Dataverse and the reporting database

De-facto alternative — Synapse Link

As per Microsoft documentation, it appears that the de-facto, ready-made alternative to DES is the Synapse Link Service (https://powerapps.microsoft.com/en-us/blog/do-more-with-data-from-data-export-service-to-azure-synapse-link-for-dataverse/). The following are some of the pros and cons that was observed after evaluating the Synapse Link Service taking in to consideration the clients requirements stated earlier.

Pros:-

No development work required and can be configured and put into production immediately
Has all the features of DES including handing of meta-data changes
Virtually no changes would be required on the Power BI reporting

Cons:-

Synapse Link Service only maintains a near real time data sync between Dataverse and a Azure Storage Account.
Additional Azure Data Factory (ADF) pipeline is required to push data to a SQL database although this is readily available via a ADF pipeline template on the pipeline gallery (https://learn.microsoft.com/en-us/power-apps/maker/data-platform/azure-synapse-link-pipelines?tabs=synapse-analytics).
There is a considerable latency between having the data synced to the Storage Account and the subsequent push to the SQL database via the ADF pipeline, at times reaching 1 hour.
To take full advantage of this solution, a Synapse Workspace is required with the Power BI reporting sourced from this workspace instead of the SQL database. This could require considerable amount of changes to the Power BI reports.
This solution is likely to cost more than any other alternative due to the potential need of a Synapse Workspace and the additional associated compute and storage requirements.

Custom built ADF alternative

There is apparently no like-for-like DES replacement available from Microsoft. Of course there maybe other third party vendors providing packaged solutions in the market, however for this client we only explored options utilising products and services available on the Microsoft data integration platform.

From the outset the clients preference was to build a solution with the same topology as DES where the data is directly synced between Dataverse and SQL database without having any intermediary stages such as a storage account as this would help to achieve similar performance levels and also keep the costs low. As the client was already using Azure Data Factory as the main data integration tool throughout the organisation in a number of projects it would be the ideal tool move the data between Dataverse and Azure SQL database. With regards to sourcing the data, ADF could either use the Dataverse connector that came out of the box and use FetchXML to query the tables for delta changes or the same could be achieved with the Dataverse Web API which is a REST API based on the OData protocol. After evaluating both these technologies, the following is a list of pros and cons of each that I observed.

FetchXML:-

Easier to write queries as its similar to regular SQL
Performs relatively well
There is a apparent limitation of 5000 rows that can be fetched at a time and this may require special handling within ADF
Columns with a large number of null values may get dropped if they are not within FetchXML sample set
Only a limited set of meta-data is exposed to FetchXML and may not be possible to easily build a meta-data maintenance pipeline to track and handle meta-data changes between Dataverse and SQL database using ADF

Dataverse Web API:-

Queries need to be constructed as part of the URL
Performance similar to FetchXML
There is a limitation of 5000 rows per page, but this can be handled via OData paging rules within ADF REST API source connector without the need to build any additional looping
Columns with null values does not get dropped
Able to retrieve all meta-data relating to entities and attributes thereby making it possible to build a metadata maintenance pipeline to handle meta-data changes between Dataverse and SQL database using ADF
Also able to retrieve both OptionSetMetadata and GlobalOptionSetMetadata

Having evaluated these options, my recommendation would be to use the Web API to source the data from Dataverse as it is the more versatile method to extract both data and meta-data.

High Level Design

The proposed solution consists of the two ADF pipelines; one to track and update meta-data changes between Dataverse model and the SQL database to and another to copy the data from Dataverse to SQL database. The first pipeline is the metadata maintenance pipeline which collects and maintains all the metadata to drive the data load pipeline. Although it is possible to run both pipelines together, for the best performance and operational efficiency it is recommended to separate the two and run the metadata pipeline in a controlled manner during a maintenance window where any changes made to the Dataverse model is announced in advance. This also gives a chance for the Reporting side to make any additional changes if required.

The following diagram depicts the high level design of the metadata maintenance pipeline:

In additional to extracting data, the Dataverse Web API can also be used to extract metadata pertaining to the entities of a Dyanamics application. More information regarding these can be found at these links:

As shown in the above diagram, an ADF pipeline extracts all entity metadata from the EntityDefinisions API end point and stores them in a table named EntityMetadata in the SQL database. The list of entities in this table is then used in a loop activity to iteratively extract their attribute level metadata which is then stored in a table named AttributeMetadata. At the same time all option set terms pertaining to any choice attributes gets stored in another table named OptionSetMetadata. A dynamic SQL DDL script then compares the collected metadata with the corresponding schema information on the SQL database to detect any entity and attribute level differences. Where differences are found the script dynamically builds the required DDL statements to add tables, add columns and/or alter columns to bring the SQL schema be in sync with the Dataverse model. Note that the script does not drops any tables or columns which it detects have been removed on the Dataverse model. Additionally the pipeline also extracts the global option set terms from the GlobalOptionSetDefinisions API end point and stores them in a separate table named GlobalOptionSetMetadata in the SQL database.

The following diagram depicts the high level design of the data load pipeline:

As shown in the above diagram, a second ADF pipeline uses the metadata collected and stored from the first pipeline to iteratively extract data from the corresponding API endpoint for each entity and load them to the respective table in the SQL database. The API request for data consists of a filtering criteria that can extract only the delta records from the respective entity. The filter gets dynamically constructed from metadata held within the EntityMetadata table pertaining to its delta marker attribute, and its most recent maximum value which typically is the ‘modifiedon’ datetime value that gets updated post a successful extract and load.

Git Repo

The SQL scripts supporting this solution has been placed in the following github repository: https://github.com/rarpal/DynamicsWebAPIToSQL.git

A working prototype of the solution described here was successfully implemented for one of my clients. It could to be adapted to meet specific requirements of other organisations and clients. Please get in touch if you would like to know more details.

Real time Train Operator Performance Analysis with Azure Event Hubs and Databricks Structured Streaming

Network Rail is the owner and infrastructure manager for most of the rail network in UK. They have been operating a real time data feed for quite a long time, but the feed was pretty much limited to the train operating companies (TOC's). In recent years however they have also made available a public feed of this data inline with the UK government's Open Data initiatives.

The public feed is available over several protocols which includes a modern HTTP based REST service as well as the much older STOMP base messaging service which is still the one widely used by much of the TOC's themselves.

I have been playing around with data streaming for a while now, but its all been the usual suspects like your social media feeds such as Twitter, Facebook, Instagram, ext, ext.. I was looking for something more interesting and remembered the network rail real time feeds. After looking more closely at the options available I decided to use the STOMP based feed as I was a bit more familiar with this datasets having previously worked with them. Anyone can gain access to this data by first registering on the Network Rail data feed website. There is however a limit to the number of users they register which is revised frequently based on demand and capacity. I have listed all the relevant links to their sites within this article, so you can also check it out if you are interested.

My main goal here is to document and demonstrate the work involved in gaining access to the live feed and establishing a data streaming service utilising Azure Event Hubs and Spark Structured Streaming on Databricks. In order to crystallise this I have depicted the basic tech stack architecture in the diagram below:

As seen in the above diagram, Network Rail real-time data feed is sourced from an internal enterprise system named TRUST which is used for monitoring the progress of trains and tracking delays on the rail network. There are a number of datasets available from TRUST which you have to select when you register for the data feeds. For this piece of work I have chosen the Train Movements dataset. This dataset consists of a number of topics exposed via an ActiveMQ messaging service that registered users can subscribe to using a client app over the Stomp protocol. Once subscribed, you receive JSON messages that can be sent to an Azure Event Hub in small batches where they can be consumed by Spark Structured Streaming data frames running on Azure Databricks. The train movement data can be put to good use in a large number of monitoring, alerting and analytical applications. In this instance I have decided to use the data to produce a simple streaming dataset that provides for a specific train company information about the number of trains that were on time, late or even early within a 10 minute rolling time window. Of course you could use the data for far more complex and more useful applications, but the objective here is to understand and highlight the basic technicalities involved in doing such a project.

Getting the Source Data

There is a good deal of information about the Network Rail data online if you are interested in the subject of rail operations in the UK. Getting access to the real time data feed is done by first registering on their site which is https://datafeeds.networkrail.co.uk/ntrod/login. Once registered you can select the datasets you are interested in.

Information about the registration process can be found here:

https://www.networkrail.co.uk/who-we-are/transparency-and-ethics/transparency/open-data-feeds/,

Information about the datasets and topics can be found in their Wiki pages here:

https://wiki.openraildata.com/index.php/Main_Page

The Client App

The client app itself can be a simple background job running on an on-premise or cloud server and can be written in any language that supports the Stomp protocol either natively or via third party libraries. For this purpose I have written a small console app in Python using the open source Python Stomp library available from the Python library index. In addition this app also uses the Python Azure Event Hub library in order to send the data received from the real time feed to the Azure Event Hub.

Below is the Python code for this client app in its most simplistic form:

import stomp
from azure.eventhub import EventHubProducerClient, EventData

NETWORK_RAIL_AUTH = ("<network rail login name>", "<password>")

class Listener(object):

    def __init__(self, mq, pr, eb):
        self._mq = mq
        self._pr = pr
        self._eb = eb

    def on_message(self, headers, message):

        print(message)

        try:            
            self._eb.add(EventData(message))

        except ValueError:
            self._pr.send_batch(self._eb)
            self._eb = self._pr.create_batch()

        self._mq.ack(id=headers['message-id'], subscription=headers['subscription'])

def main():
	producer = EventHubProducerClient.from_connection_string(conn_str="<event hub namespace connection string>", eventhub_name="<event hub name")
	event_data_batch = producer.create_batch()

	mq = stomp.Connection(host_and_ports=[('datafeeds.networkrail.co.uk', 61618)],
						keepalive=True,
						vhost='datafeeds.networkrail.co.uk',
						heartbeats=(100000, 50000))
	mq.set_listener('', Listener(mq, producer, event_data_batch))
	mq.connect(username=NETWORK_RAIL_AUTH[0],
				passcode=NETWORK_RAIL_AUTH[1],
				wait=True)

	mq.subscribe('/topic/TRAIN_MVT_ALL_TOC', 'tmat-1', ack='client-individual')
	input('press any key to stop')

	mq.unsubscribe('tmat-1')
	mq.disconnect()
	producer.close()
	print('connection closed')

if __name__ == "__main__":
    main()

The credentials in line 4 should be replaced with the login name and password that was used to register for the live feeds.

The <event hub namespace connection string> in line 27 should be replaced with the correct connection string for the Azure Event Hub Namespace. Also the <event hub name> should be replaced with the name of the Event Hub Instance. These details are available on the Azure portal once the event hub has been provisioned on the relevant Azure subscription. I have further explanations regarding the event hubs in the following sections of this article.

In the above code, lines 1 and 2 brings in the Stomp library and Azure Event Hub library which are the only external Python libraries required for this basic app.

In line 6 we setup a Listener class which has the job of listening to the messages arriving from the subscribed live feed. This class takes in as parameters a Stomp connection object, an Azure Event Hub producer object, and an Azure Event Hub data batch object. At the heart of this class is the 'on_message' method in line 13 which gets called by Stomp listener with each message that it receives. It then attempts to add those messages to the event hub data batch object in line 18. This continues to happen until the total size of all the received messages exceed the size limit of the event hub data batch at which point it enters the exception defined in line 20. The exception handler then sends the event hub data batch object containing the data so far collected to the Azure Event Hub in line 21. Line 22 then replenishes the data batch object with a new instance to resume the process of collecting new messages again.

In the main function, lines 27 and 28 sets up the connection to the Azure Event Hub from the information obtained from the provisioned resources. Line 30 makes the connection to the message queue over Stomp. Note the parameters being passed here pertain to the host ActiveMQ server at Network Rail along with the relevant port enabled for the connection. This information can be obtained from the Wiki pages given earlier. Additionally the heartbeat parameter is optional but is encouraged to make the connection more efficient and fault tolerant. More information about this along with the documentation on the Python Stomp library is available on this link: https://github.com/jasonrbriggs/stomp.py

An instance of the Listener class defined earlier gets assigned to the Stomp connection in line 34 before the actual connection is made in line 35 using the credentials provided at the start in line 4.

The listener starts in line 39, when a subscribe is made on the Stomp connection to the message queue. Notice that the first string parameter is the path to topic which in this case is TRAIN_MVT_ALL_TOC. This is the codified topic name for 'Train Movements -All TOC's' dataset available on the Network Rail data feeds. If another dataset is required, then further subscribes could be made by finding out their codified topic names from the Network Rail Open Data Wiki page links given earlier. Since it is possible to have multiple subscribes on the same Stomp connection, the second string parameter takes a unique identifier that can be used to reference them in subsequent actions such as unsubscribes as can be seen in line 42.

This simple client app runs as a console application and once started will continue to run printing each message as they are received until the user terminated the app by hitting any key. Note that this app is written this way for simplicity and ease of explanations and is not a production ready app. Typically this would be run as a background service job with additional logging, exception handling, fault tolerance and performance characteristics built in.

For demonstration purposes I am simply printing the received messages to the console in line 15 before sending them to the event hub. The clip below shows how this app is receiving data from the Network Rail real time feed in a VS Code terminal.

Azure Event Hub

Azure Event Hubs are based on the same underlying platform as Azure Service Bus. However event hubs are geared more towards ingesting and handling big data streams. Hence this is the perfect service to ingest the real time feeds from Network Rail. All that is needed is to provision one on a Azure subscription. This is a fairly straight forward process which could be done via a fully automated DevOps process, or for demonstration and learning purposes simply creating one on the Azure portal.

For details on how to provision an Azure Event Hub, see the link below:

https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-create

Once the event hub and instance have been created, the Primary Key Connection Strings of both the event hub namespace and the event hub instance should be noted along with the name of the instance. These pieces of information can be easily obtained from the Azure portal under the provisioned event hub. They are needed when writing and/or configuring the client app that sends data to the event hub and the Databricks notebook that consumes the data streams ingested from the event hub.

Azure Databricks

Azure Databricks is essentially Spark provisioned on the Azure platform. Spark offers Structured Streaming based on dataframes which is ideal for the performance analytics that we need to build here. Similar to Azure Event Hubs, provisioning an Azure Databricks cluster can be done in a automated manner using DevOps or by simply using the Azure Portal.

For details on Azure Databricks and how to provision clusters can be found on the link below:

https://docs.microsoft.com/en-us/azure/databricks/clusters/

Databricks notebook for real time TOC performance analysis

For the purpose of conducting this simple performance analysis, I set up a Databricks notebook on the portal under an existing workspace.

See the link below for more details on how to setup workspaces and notebooks:

https://docs.microsoft.com/en-us/azure/databricks/workspace/

Databricks notebooks supports Python, Scala, SQL and R languages. I have used Python for this notebook along with the build in PySpark library. Here we can consume the data being ingested into the Azure Event Hub using a Structured Streaming dataframe.

Before being able to consume data from the event hub, we need to install the Apache Spark Event Hub Connector library package on to the cluster provisioned for the job. To do this on the portal launch into the Databricks workspace and go to the Compute tab where you can see the clusters. Make sure the cluster is up and running, then click on 'Install New' as seen below:

Chose Marven as the Library Source and provide the Coordinates to the package as seen below:

Note it is important to match the package version to the version of the Databricks cluster. To help with this there is a handy table put together by the good folks who created this package which can be found in the link below:

https://github.com/Azure/azure-event-hubs-spark

I have highlighted below the version I have selected from the table below, based on the version of the cluster that I provisioned:

You can get the exact coordinates for the package by clicking on the Package Version above which takes you to the Marven Central Repository. Once the package is installed it will appear listed under the Libraries tab on the cluster page.

With the required library package installed, its time to take a look at the notebook in more detail. Before diving in though its good to get to know the train movement data a bit better. These are essentially the real time event records that reports on the progress of trains along their journey. They include every arrival and departure at a station stop, along with the corresponding timestamps and a variation status that indicate if the event occurred on time, late or early. In addition to this, there are also many other pieces of data that gets embedded in those records providing information on the train operator, location, service, scheduled times, ext., ext.. However for the purpose of this work, only the train operator id, actual timestamp and variation status is needed. From the train operator id, it is possible to obtain the train operator name by referencing a static list in the form of a simple delimited text file which can be uploaded to the Azure storage beforehand.

The table below lists the data that is used for this simple performance analysis:

Train Movement Feed
division_code	Operating company ID as per TOC Codes
actual_timestamp	The date and time that this event happened at the location
variation_status	One of "ON TIME", "EARLY", "LATE" or "OFF ROUTE"

Static Reference for TOC names
Sectorcode	Operating company ID as per TOC Codes
Description	Name of operating company

For the purpose of this article and for simplicity, the analysis has been limited to a single TOC. Hence only the single record pertaining to that TOC has been inserted into the static reference file. If other TOC's needs to be included in the analysis then its simply a matter of inserting the records for those in the reference file.

In the first cell of the notebook we configure the Spark Context to allow access to the Azure Storage Account so that it can retrieve the TOC codes and names from the static reference file. We also setup a variable to hold the connection string to the Azure Event Hub Instance. Note that this connection string is not the same as the one used earlier in the Client App to send data to the event hub. That was the connection string to the event hub namespace whereas this is the connection string to the event hub instance. The code for this cell is shown below:

from pyspark.sql.types import *
import  pyspark.sql.functions as F

# Set access key credentials to the Azure Data Lake Gen 2 at the Spark Context
spark.conf.set(
  "fs.azure.account.key.<azure storage account name>.dfs.core.windows.net",
  "<azure storage account access key>")

# Connection string to the Event Hub Instance
conn_str = "<event hub instance connection string>"
# Initialize event hub config dictionary with connectionString
ehConf = {}
ehConf['eventhubs.connectionString'] = conn_str
# Add consumer group to the ehConf dictionary
ehConf['eventhubs.consumerGroup'] = "$Default" 
# Encrypt ehConf connectionString property
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(conn_str)

First the dependant libraries are imported into the notebook in line 1 and 2.

The string values in line 6 and 7 should be replaced with the Account Name and Access Key to the Azure Storage Account where the static reference file is stored. The Access Key can be obtained from the Azure portal under the provisioned storage account.

The string value in line 10 should be replaced with the connection string to the Azure Event Hub Instance. As mentioned earlier this information can be obtained from the Azure portal under the provisioned event hub instance

In the next cell we can load both the streaming data and static reference data into dataframes:

1
2

dfstream = spark.readStream.format("eventhubs").options(**ehConf).load()
dfref = spark.read.option("delimiter","\t").csv("abfss://refdata@<storage account name>.dfs.core.windows.net/sectorcodes.csv", header="true")

Line 1 loads the real time train movements data feed from the event hub to a streaming dataframe using the configuration information setup in the dictionary variable in cell 1. Line 2 loads the static reference file containing the TOC codes and names into a static dataframe. Note that this is a tab delimited text csv file named 'sectorcodes.csv' stored in a container named 'refdata' in a Azure Storage Account. The string <storage account name> should be replaced with the respective name of the provisioned storage account.

Next we define the schema that maps to the JSON structure of the messages received from the Network Rail data feed. This is a crucial piece to get right without which the data would not be usable. Fortunately the Network Rail data feed messages are not hugely complex and can be handled with the schema shown in the following cell:

sch = ArrayType(
  StructType([
      StructField("header",StructType([
          StructField("msg_queue_timestamp",StringType(),True),
          StructField("msg_type",StringType(),True),
          StructField("original_data_source",StringType(),True),
          StructField("source_dev_id",StringType(),True),
          StructField("source_system_id",StringType(),True),
          StructField("user_id",StringType(),True)
      ])),
      StructField("body",StructType([
          StructField("actual_timestamp",StringType(),True),
          StructField("auto_expected",StringType(),True),
          StructField("correction_ind",StringType(),True),
          StructField("current_train_id",StringType(),True),
          StructField("delay_monitoring_point",StringType(),True),
          StructField("direction_ind",StringType(),True),
          StructField("division_code",StringType(),True),
          StructField("event_source",StringType(),True),
          StructField("event_type",StringType(),True),
          StructField("gbtt_timestamp",StringType(),True),
          StructField("line_ind",StringType(),True),
          StructField("loc_stanox",StringType(),True),
          StructField("next_report_run_time",StringType(),True),
          StructField("next_report_stanox",StringType(),True),
          StructField("offroute_ind",StringType(),True),
          StructField("original_loc_stanox",StringType(),True),
          StructField("original_loc_timestamp",StringType(),True),
          StructField("planned_event_type",StringType(),True),
          StructField("planned_timestamp",StringType(),True),
          StructField("platform",StringType(),True),
          StructField("reporting_stanox",StringType(),True),
          StructField("route",StringType(),True),
          StructField("timetable_variation",StringType(),True),
          StructField("toc_id",StringType(),True),
          StructField("train_file_address",StringType(),True),
          StructField("train_id",StringType(),True),
          StructField("train_service_code",StringType(),True),
          StructField("train_terminated",StringType(),True),
          StructField("variation_status",StringType(),True)
      ]))
  ])
)

As can be seen in the code above, schema is a array of structures each element comprising of 2 nested field structures, one called 'header' and another called 'body'. As discussed earlier although we schematise and map the entire message structure, only a small number of fields from the body is used for our analysis. With this schema in place we can actually convert the JSON messages into usable dataframes and run the required analysis in the next cell as shown below:

dfsj = dfstream.select(F.from_json(F.col("body").cast("string"), sch).alias("payload"))
dfsjrows = dfsj.select(F.explode(dfsj.payload))
dfevtoc = dfsjrows.join(dfref, dfsjrows.col.body.toc_id == dfref.Sectorcode)\
  .withColumn("event_time", F.to_timestamp(F.from_unixtime((dfsjrows.col.body.actual_timestamp.cast("double"))/1000)))
dftocperf = dfevtoc.withWatermark("event_time", "30 minutes")\
  .groupBy(F.window(F.col("event_time"), "10 minutes", "5 minutes"), "Description",dfevtoc.col.body.variation_status).count()\
  .orderBy("Description","window")

Note that the messages arriving from the event hub are in binary form and needs to the converted to string to reveal the JSON contained within. Only then can the schema defined earlier could be applied to the JSON to convert them to a dataframe. These steps can all be easily done in a single statement as seen in line 1.

Since each JSON message is an array nested elements, they need to be exploded into individual train movement records. This steps happens in line 2.

In line 3, we join the explored train movement records with the TOC code reference data to being in the TOC names. The statement in line 3 is chained to derive a new column that is a conversion of the actual timestamp from unix time to regular datetime. Note that this is required due to the fact that all time fields from the Network Rail data feed arrives in unix time.

Finally in line 5 we create the analysis dataframe by calculating the count of train movement events grouped by a time window, TOC name and the variation status. Notice that the count is based on a 10 minute rolling time window with a the count being updated every 5 minutes. An optional watermark time of 30 minutes is also applied to prevent a memory overrun as this demonstration notebook is only meant to be run on limited resources. The resulting dataframe also gets ordered by TOC name and the time window. Note that the grouping by time window is not of any value for the analysis, but its only present here to demonstrate the frequency of the real time data as they arrive into the streaming dataframe.

To learn more about aggregations on streaming dataframes along with time windows and watermarks, refer to the documentation at Databricks in following link:

https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html

As long as the client app is sending data from real time data feed to the event hub and the event hub is streaming to Databricks ,the final dataframe will continue to aggregate as per specified fields and time windows. The dataframe can be used in many different ways depending on the requirement. i.e. It could be written to a database for further processing, It could be written back to another event hub for further onward streaming and directly consumed by a dashboard on Power BI or any other thin client or rich client app.

However on this notebook, for demonstration purposes, we can simply display the final dataframe using the Databricks display statement as shown below:

As mentioned earlier, the window column has been included in the aggregation here to demonstrate the frequency of the rolling time window within which the train movement events are being counted. The description column referees to the TOC name which has been referenced from the static reference file containing TOC codes and names. The variation status column is displayed by its fully qualified dataframe path.

The final analysis could of course be further refined, although my objective here was to document and demonstrate the work involved in building a basic data streaming service over this tech stack.

Whilst most application development environments utilise local databases running on individual developer machines, some projects such as ETL and data warehousing may require to work with shared databases where the databases run on a shared server and used on by many developers working on multiple projects.

In these circumstances the developers have to work very carefully with close co-ordination to ensure that they don't inadvertently undo each others work. If one developer has to work in isolation often they may make a temporary copy of the database on the shared server and work on that one separately. This can lead to a lot of unwanted branch databases lying around the server which are highly out of sync with the trunk version.

Now that Microsoft has made the docker images available for SQL Server 2016 and 2017 developer edition, there is a much better solution for managing shared database development.

I am documenting here the process I went though to set up such an environment with a set of sample databases. For simplicity sake, I have only created very basic database structure and not used any complex models.

The Docker engine for Windows can be run on Windows Server 2016 native containers and with Hyper-V enabled on Windows 10 Pro or Enterprise editions.

For the purpose of this documentation work, I am running Docker on my Windows 10 Pro machine with Hyper V enabled and my development environment is based on SQL Server 2016 Developer Edition.

So first I downloaded the Docker Image for SQL Server 2016 Developer from Docker Hub as follows:

docker pull microsoft/mssql-server-windows-developer:2016-sp1-windowsservercore-10.0.14393.693

The database development environment consist of a database that represent the Trunk branch (the production version) of the sample database application which I call APP1DB_Trunk.

In order to create a Docker image for this database I took copies of the .MDF and .LDF files of the existing APP1DB database. These files I stored locally in a folder named D:\PalProjects\Docker\sqldb1. Then in the same folder I created a text file named Dockerfile where the docker image instructions are placed as follows:

# using 2016-sp1-windowsservercore-10.0.14393.693 image
FROM microsoft/mssql-server-windows-developer:2016-sp1-windowsservercore-10.0.14393.693
 
# create directory within SQL container for database files
RUN powershell -Command mkdir C:\\SQLServer
 
#copy the database files from host to container
COPY APP1DB.mdf C:\\SQLServer
COPY APP1DB_log.ldf C:\\SQLServer
 
# set environment variables
ENV sa_password=Dock!4!SQL
 
ENV ACCEPT_EULA=Y

The FROM command instructs docker to use the SQL Server 2016 image which includes its dependant Windows Server Core. Then it runs a PowerShell command to create a new folder named SQLServer on the container's C: drive to which it copies the .MDF and .LDF database files from the local folder. It then sets two environment variables for the SA password and end user licence acceptance indicator. At present I have only intend to use SQL Authentication with SA user to connect to the SQL Server container.

Next, I change directory (CD) to the folder where I have placed the Dockerfile and the database files and ran the following to create the docker image of the database Trunk branch.

docker build -t clt1/app1db_trunk:1.1-2016 .

-t parameter indicates the tag name given to the image. I have used a naming convention as follows:

<client id>/<database name>_<branch>:<database version>-<sql server version>

As such my client ID is 1, database name is APP1DB, branch is Trunk, database version is 1.1 and SQL Server version is 2016. This is the naming convention I have used in this sample, but you may follow your own convention. Also note that docker image names must be all in lower case.

When this is completes, I check the docker image list with the following command:

docker image ls

This lists all the images present in my repository as depicted below:

Now when this image is launched into a container, the instructions in the Dockerfile gets processed and the database files gets copied over from the current host directory (C:\PalProjects\Docker\sqldb1) into the C:\SQLServer directory which gets created inside the container via the PowerShell mkdir command. When the container starts, an instance of the SQL Server 2016 should be running inside the container. However, the Dockerfile does not have any instructions to actually attach the .mdf and .ldf files to a database. This will be given as a parameter in the following command which launches the container from the image:

docker run --name APP1DB_dev -d -p 1433:1433 -e attach_dbs="[{'dbName':'APP1DB','dbFiles':['C:\\SQLServer\\APP1DB.mdf','C:\\SQLServer\\APP1DB_log.ldf']}]" clt1/app1db_trunk:1.1-2016

This command launches the container from the clt1/app1db_trunk:1.1-2016 image given in the last parameter above. The --name parameter assigns the name APP1DB_dev to the container. -d launches the container in detached mode. -p maps a port number from the image host (left side of the colon) to the image container (right side of the colon). As mentioned earlier, the Dockerfile did not have any instructions for attaching the .mdf and .ldf database files to create a database within the container. This is achieved by an -e parameter which sets the environment variable attach_dbs. The parameter value is a JSON string containing key/value pairs for the database name (dbName), and the database file and log file name (dbFiles) in a two element array.

Any number of containers can now be launched from the single image. In addition to the container named APP1DB_dev which I launched for development, I had also launched a second container for testing with the same command only with the --name parameter changed to APP1DB_uat.

To check what containers are currently runinng, issue the following command:

docker ps

The currently running containers are listed as depicted below:

The containers configuration can be examined by using the docker inspect commandL

docker inspect APP1DB_dev

This will show the entire configuration as a JSON object. However for the purpose of being able to connect to the SQL Server instance running inside the container, I only want to know the IP Address that docker has assigned to the container. The following parameter can be used to filter the specific JSON item within the configuration:

docker inspect APP1DB_dev -f "{{.NetworkSettings.Networks.nat.IPAddress}}"

The above parameter will show the IP Address docker has assigned to the container.

Note:- An important point to note is that the IP Address displayed above is only visible within the host environment. This means it cannot be accessed from outside the host from the LAN. However it is possible to configure the containers network to allow docker to use the LAN's IP address range when assigning an address to the container.

For this documentation, I have not gone into this network configuration and had limited the activity to the host environment only. Visit the Docker reference documentation website at https://docs.docker.com/engine/reference/run/ for detailed information on this topic. I shall update this document at a later date with how to configure container network to use the LAN's IP address range.

So with that said, to connect to the SQL Server instance within the above containers, launch SQL Server Management Studio and connect as follows :

After successfully connecting, it will be possible to see the database that was attached within the Object Explorer :

Having connected to the database, I then added some data to TABLE2 like so:

Now I kept the best part of docker containers to the last. As I said at the beginning, when developing and testing in a shared database environment, there is often a need to have the database in a specific state to conduct a specific test. Perhaps to run in lengthy ETL flow that could take several hours or days. This can be very tricky to co-ordinate with other developers and testers. You can always launch a new container from the trunk image, but that defeats the point of being able to have the database in a state (mostly due to loosing the test data) appropriate for the test.

So before being able to commit a running container, it must be stopped. Now it appears that this is only a limitation with Windows and does not apply to Unix platforms. Hopefully this is something that would be addressed in the future. To stop the running container issue the following command:

docker stop APP1DB_dev

With the container stopped, its current state can be committed to a new image with the following command:

docker commit APP1DB_dev clt1/app1db_dev_test2:1.2-2016

This creates a new image by the name clt1/app1db_dev_test2.1.2-2016. As per my naming convention this referrers to a development branch of the APP1DB database used specifically for the test2. Since I could have also added any new objects such as tables, views stored procedures, I have increased the version to 1.2. The -2016 still referrers to the SQL Server 2016 version.

The new image can be seen in the repository with the 'docker image ls' command as used previously. This image now contains all the changes that was made so far in the development branch under the APP1DB_dev container. Now a new container can be launched from the newly committed image with the following command:

docker run --APP1DB_dev_test2 -d clt1/app1db_dev_test2:1.2-2016

This launched the new container with the name APP1DB_dev_test2. I then examine its IP address in order to connect to the SQL Server instance as follows:

I can now connect to this IP Address with SQL Server Management Studio and run the required tests in isolation. The following shows the new rows that I added to the development branch container before committing its image.

Those rows can now be seen in the new container and I can now make any changes to this container APP1DB_dev_test2, without effecting the other development branch container APP1DB_dev.