So now we can build this minimal SSIS package, to first stage and then load the contacts.txt file into the SQL database.
Open the LoadContacts.dtsx package in Visual Studio and add the 2 following package level variables:
pkDBConnectionString :- This string type variable will hold the OLEDB connection string to the detination SQL database.
pkSourceFilePath :- This string type variable will hold the file system path to the location where the contacts.txt file will be delivered.
The above code fetches the staged data from the staging.Contacts table with an additional field named rno that indicates a row number based on the most recent AddressDate column for each contact number represented by the Number column. This field is used later on down the pipeline for the purposes of de-duplication.
Next we add a Conditional Split component which uses a expression rno == 1 to check for de-duplicated records. All records with rno > 1 are considered duplicates and are ignored from the pipeline.
Add a Lookup component which will be used to check for the existence of contacts in the destination database. Set the component to use the cnOLEDB_ContactsDB connection manager and code the following SQL command to fetch the existing contact records from the database into the lookup.
Map the stg_Number from the source data to the id_Number field in the database table and select all other fields to be returned to the pipeline when a record is found.
Add another Conditional Split component and connect the 'Match' branch from the Lookup component to this split component. Here we will do a further check using an expression to see if any of the fields have changed between the staged source and destination records. Insert the following expression into the split condition:
Bellow the second split component add a OLEDB Command component and connect the split component to it. Set the cnOLEDB_ContactsDB connection manager to it and add the following SQL command.
The above command will update the matched record where a change has taken place. As can be seen the above command takes in the values for each of the fields as parameters. These parameters gets default names as param_0, param_1, param_2, param_3, param_4 in the order they appear in the command. Map each of these parameters to the staged source data fields as follows:
Please note that the above technique is not the most efficient way of updating records in a production environment, especially if the data set is large. However for the sake of simplicity we will use this method for this sample application.
For the 'No Match' branch of the lookup component, we will add a OLEDB Destination component and connect that branch to it. All new contact records with no matching numbers will travel via this branch to get inserted into the destination table. Set the cnOLEDB_ContactsDB connection manager to the destination component and select the destination table dim_Contacts from the drop down. Map all the fields from the sourced staging data set to the destination table as follows. Ignoring the key_Contact field since this is an incremental auto-identity field.
Now that the SSIS package has been built, we will see how to develop some basic unit tests that will cover the most important business rules of this ETL process.
As mentioned earlier we will be using the SSISUnit .Net API with C# via the Visual Studio's own built-in unit testing framework. The main benefit you get with this method over the others, is that, you can now perform the unit tests as part of the solution build process. This can also be easily incorporated into a continuous integration process that you may be carrying out on regular basis. All the code relating to the application and tests are all located together within the same solution and saves you the trouble of maintaining any additional testing scripts in other locations.
SSISUnit offers the ability perform unit tests at individual task level or entire package level. At which level you conduct your tests depend entirely on the complexity of the application. For our sample project, we will be conducting one test at the task level and others at the package level, so you get experience both approaches.
All tests will follow the traditional xUnit structure of Setup, Test and Tear-down.
In this test we will test the individual data flow task within the package that loads the data from the source file to the staging.Contacts table. For this we will use the following data set in a tab delimited text file. We will save this file as contacts_batch1.txt inside a sub folder named SourceFiles within the unit test project folder. As we shall see later, the test setup will take a copy of this file into the contacts.txt file which is used by the package.
Number | Name | Telephone | Address | AddressDate |
1 | James | 787,387,423,568,938,000,000,000,000,000,000 | 24 Harper Ave, Middlesex, MH2 4BU | 22/04/2003 |
1 | James | 787,387,423,568,938,000,000,000,000,000,000 | 154 Arvin Drive, Middlesex, MH5 7GD | 14/12/2005 |
2 | Charls | 7,656,928,671,776,520,000,000 | 1A Park Street, Bedford, BD7 9OP | 16/01/1987 |
3 | Carol | | 30 Church Road, Brighton, Essex, BR8 7UT | 02/09/2000 |
4 | Sarah | 9,897,876,521,276,760,000,000 | 38 Kings Lane, Wimbledon, London, SW7 5TR | 07/12/2006 |
4 | Sarah | 9,897,876,521,276,760,000,000 | 38 Kings Lane, Wimbledon, London, SW7 5TR | 07/12/2006 |
Now we need to create a test definition for this in xml. To do this open the UnitTest_LoadContacts.ssisUnit file that we added to the unit test project earlier and insert the following xml code.
If you take a closer look at the above xml, you will see some basic elements that are organized as follows.
<TestSuite>
<!-- A Test Suite is the main container for the entire test definition. A Test Suite contain one or more
individual unit tests -->
<ConnectionList>
<!-- A Connection List contain one or more connection string to specific databases. These
connections are then referenced from elsewhere in the definition to issue SQL commands to
those databases --></ConnectionList>
<PackageList>
<!-- A Package List contain one or more fully qualified package paths pointing to individual SSIS
packages. These are then referenced from elsewhere in the definition --> </PackageList>
<TestSuiteSetup>
<!-- Any setup that is required and common to all tests within the test suite are defined within this
element. Setup can contain multiple commands, which can perform different types of
operations. The supported command types are SQLCommand, ProcessCommand and
VariableCommand. Note that this setup only runs once for the entire test suite-->
</TestSuiteSetup>
<Setup>
<!-- Similar to the previous element, this element also may contain multiple commands. However
unlike the previous setup this setup runs once for each individual test defined within the test
suite. Hence these commands are intended to perform any common operations that are
needed prior to running individual tests. --></Setup>
<Tests>
<!-- This element is acts as the container that holds one or more individual tests to be defines for
the test suite. -->
<Test>
<!-- This is the container element for an individual test. Each test will follow the Setup, Test,
Tear-down structure. -->
<TestSetup>
<!-- Within this element, any setup needed for this specific setup will be defined. Similar to
the other setup's, the command types used here can be SQLCommand,
ProcessCommand or VariableCommand. --></TestSetup>
<Assert>
<!-- This is the container element for the assertion definition of this individual test. Here the
expected results and the type of this assertion is established. Inside this element there can
be one more more commands defined that checks for the actual results of the test. These
command are of type SQLCommand, ProcessCommand and VariableCommand. -->
</Assert>
<TestTearDown>
<!-- Within this element there can be multiple commands that perform any operations needed
to clean up or reinstate data after the test has completed --></TestTearDown>
</Test>
</Tests>
<TearDown>
<!-- This element may contain one or more commands that perform any common operations
needed to clean up or reinstate data after each individual test within the test suite has
completed. --></TearDown>
<TestSuiteTearDown>
<!-- This element may contain one or more command that perform any operations needed to clean
up or reinstate data after the entire test suite has completed. Unlike the previous tear down
element, the command contained within this element will only run once for the entire test
suite. --></TestSuiteTearDown>
</TestSuite>
Given the above information, lets take some time to examine the xml code for the test definition. We start by looking at the <Test> element which declares the unit test. We give its name attribute to be "TestBatch1Staging". The package attribute carries a reference to the SSIS package listed in the <PackageList> element at the top of the definition file. The task attribute specifies which specific task within the to execute to conduct the test. In this case we are testing the staging Data Flow named as cfDFT_StageContacts (this is a naming convention that I follow). In case we were testing the entire package, then the task attribute will carry the name of the package itself. The taskResult attribute indicates the expected result of the task or package and not the expected result of the unit test itself.
Next we look at the <TestSetup> element. This is where the commands that perform the test setup is specified. For this test, we have used three types of commands.
First we have a VariableCommand, which can perform Set and Get operations on SSIS package variables. This is a very useful feature which allows you to inject different values into the package for different test conditions or environments. Here we use it to set the file path to the contacts.txt source data file in the pkSourceFilePath variable. We also set the connection string for the destination database in the pkDBConnectionString variable.
Then we use a FileCommand, which can perform basic system file operations. We use this here to copy our Batch1 data set stored in the file contacts_batch1.txt to the contacts.txt file. This method gives us the flexibility to inject different data sets for different tests
Another SQLCommand is used to clear the staging table before running the test. This is performed by issuing the TRUNCATE TABLE staging.Contacts command on the database referenced by the connectionRef attribute.
Please note that all the folder paths, database names and connection strings specified in the code are referring to my own development environment. So you will need to replace them with the appropriate values for your environment.
Now look at the <Assert> element. Here we define its name as "AssertStagingContacts". The expectedResult attribute is "6" which is the number of records in the contacts_batch1.txt file and what we also expect to find in the staging table after the staging data flow has run. testBefore attribute is to define whether the assertion should run before the task has run. In this case we want it to run after the task has run, so we set it to false. The expression attribute is a boolean to indicate whether the assertion is an expression which in this case is not, so we set it to false.
Now we set the assertion command. You can use a number of different types of commands. For this test we use a SQLCommand named as "CmdCountStagingContacts" its connectionRef is set to "ContactsDB" which refers to the connection listed under the <ConnectionList> element at the very top of the test suite definition. The returnsValue attribute is set to "true" to indicates that the SQLCommand issued here is a simple SELECT COUNT(*) FROM staging.Contacts which returns a scalar value that will be compared to the value set in the expectedResult attribute to determine the passing or failure of the test. So if the SQL command returns a record count of 6, the test will pass, otherwise it will fail.
That concludes the definition of the test suite and the individual test within it.
Next, we need write the unit test code in C# using the SSISUnit .Net API to execute the test. This is done in the UnitTests_LoadContacts.cs file we added earlier to the project. So lets examine the code to execute the above test.
As can be seen above, the SSISUnit .Net API has been referenced by using the SsisUnit and SsisUnitBase namespaces. In addition a reference to System.xml namespace is also added to allow SSISUnit to serialize and de-serialize xml.
When you create a Unit Test Project in Visual Studios, it adds a class with the attribute [TestClass] with a public Initialize method with the attribute [TestInitialize]. The Initialize method is intended to be used for the purposes of setting configurations and allocating resources needed by all the tests in the class. In SSISUnit this initialization work will be carried out by the API based on the definitions specified within the Setup elements and so we would not need to directly code anything in the Initialize method.
Within the test class, we will create the following private member variables which will be used by all the tests.
private SsisTestSuite testSuite; /* This will hold a reference to the test suite object used by all tests */
private TestResult testResult; /* This will hold a reference to the test result object used by all tests */
private Test test; /* This will hold a reference to an individual test object */
private Context context; /* This holds a reference to a context object that is used by the test */
private bool isTestPassed; /* This is a boolean that indicates if the test has passed */
SSISUnit delegates several events that gets fired at various stages in the test execution cycle. The one that we are interested in for this project is the AssertCompleted event. This gets fired when SSISUnit has completed an Assert operation of a test. We will subscribe the following public method as the event handler to this event to handle the assertions for all the tests.
public void TestSuiteAssertCompleted(object sender, AssertCompletedEventArgs e)
{
if (e.AssertName != null)
{
testResult = e.TestExecResult;
isTestPassed = e.TestExecResult.TestPassed;
}
}
The e parameter is an object that contains all the information pertaining to the assertion. In the most simplistic use of this method we simply check the e.TestExecResult.TestPassed property indicates whether the test passed or not and assign its boolean value to the isTestPassed member variable.
We now need to add a method to conduct the test. This should be a public method with the attribute [TestMethod] added at the top. Its best to give this method the same name that was used for its definition in the xml file which in this case is TestBatch1Staging as shown bellow.
[TestMethod]
public void TestBatch1Staging()
{
testSuite = new SsisTestSuite(@"<path to the xml file>\UnitTest_LoadContacts.ssisUnit");
test = testSuite.Tests["TestBatch1Staging"];
testSuite.AssertCompleted += TestSuiteAssertCompleted;
isTestPassed = false;
context = testSuite.CreateContext();
bool rs = test.Execute(context);
Assert.AreEqual<bool>(true, isTestPassed);
testSuite.AssertCompleted -= TestSuiteAssertCompleted;
}
The first line creates a new SsisTestSuite object by passing in the test definition xml file as a parameter. The resulting object will contain all the information about the test suite.
Next we access the specific test from the Tests collection property by specifying the test name as an index.
Then we subscribe our TestSuiteAssertCompleted method to the AssertCompleted event. The isTestPassed member variable is set to false before running the test. This is so that the default result of the test will always be false until the test result is assigned to it.
A new SSISUnit test context object is created which contains any references to system file paths and database connections.
Then we execute the test by calling test.Execute method passing to it the context as a parameter. This is the method that actually runs the SSIS package or task that was specified in the test definition. The method returns a boolean that indicates whether the package or task ran successfully. Note that this is not an indication about the pass or failure of the unit test itself. That happens next with the assertion.
Now we do the assertion with the Assert.AreEqual method which is a method of the Visual Studio unit test framework. This method takes in 2 boolean values as parameters, the first being the expected value and the second being the actual value which in this case is the value held in the isTestPassed member variable. If the two values match then the test is considered passed otherwise its a fail.
Finally we unsubscribe our assertion event handler so that another test can subscribe to it.
Test 2 :- Updating contacts with de-duplication
In this unit test we want to test the entire package execution to see if it correctly updates existing data by first de-duplicating any duplicate contact records that may exist in the source data file. In order to achieve this we will use a second batch of data in the file contacts_batch2.txt as given bellow:
Number | Name | Telephone | Address | AddressDate |
5 | Paul | 676,256,354,316,735,000,000,000,000,000,000 | 201A Jasper Street, Croydon, London, CR4 9KT | 13/06/2003 |
2 | Charls | 765,692,867,177,652,000,000,000,000,000,000 | 57 Tinpan Alley, Bedford, BD4 3GH | 10/10/2000 |
2 | Charls | 765,692,867,177,652,000,000,000,000,000,000 | 78 Station Road, Bedford, BD3 6TU | 15/04/2001 |
As you can see in the above data set contact number 2 has duplicate records. As stated earlier in the Application Scenario, when inserting new records and updating old records, the ETL process should only pick the record with the most recent date which in this case should be the third record with the AddressDate of 15/04/2001. We will define this unit test to assert both the updating of an existing record and the correct de-duplication by issuing a SQL statement that checks for this specific record with these specific values.
Now lets define this test by coding the following xml into the UnitTest_LoacContacts.ssisUnit file.
Note that the above code should be inserted between the <Tests> and </Tests> elements. The test has been given a name as "TestBatch2OldDedupContact". This is the naming convention I follow, however you can choose any meaning name and also include spaces to make it more readable. You will notice that unlike the previous test Test 1, here the package and task attributes carries the same value which is the name of the package. This is because for this test the unit we are testing is the whole package. The <TestSetup> element is similar to Test 1 except that now the sourcePath attribute of the FileCommand elements refers to the contacts_batch2.txt file which will be copied overwriting the existing contacts.txt file if it already exist. Also notice that the final SQLCommand which truncates the staging table has been removed. The reason behind this would become more apparent when we examine the .Net C# code that will drives this test.
Look at the the <Assert> element. We have set its name attribute to be "AssertOldDedupContact". Again you may chose a more meaningful name if you wish, as this naming convention is just the one I have been using. The expectedResult is set to "1". The reason for this expected value will become apparent when we see the assertion SQL command. The other attributes for this element remain the same as the previous test.
For the assertion command we again use a SQLCommand with a the name attribute set as "CmdGetContactValues", and the rest of the attributes set exactly the same as the previous test. The SQL command that is issued to assert the test is as follows:
SELECT
CASE WHEN
Name = 'Charls' AND
Telephone = '76569286717,76520989012,87293752635' AND
Address = '78 Station Road, Bedford, BD3 6TU' AND
AddressDate = '2001-04-15'
THEN 1 ELSE 0 END
FROM dim_Contacts
WHERE id_Number=2
This command simply filters contact number 2 and checks for specific field values that we know are the correct values for the test to pass. If they all match then the CASE statement will return a 1, otherwise a 0. When the return value is compared to the value set in the expectedResult attribute and if they are both 1, then the test will pass, otherwise it will fail.
Now lets look at the .Net code in C# in the unit test project and see how this test is conducted via the SSISUnit .Net API. To do this we add a new method to the UnitTest_LoadContacts.cs class file in Visual Studio as follows:
As in the previous test we give the method same as given to the unit test definition. As the code for this test is very similar to the previous test, you can refer to the explanation given earlier for an understanding of the API. Where the code for this test differs from the previous one is that here we are executing two tests. The first one "TestBatch1NewContacts" is executed first, but no assertion is done. This is a separate test definition created to test for new records from the contacts_batch1.txt file being correctly inserted. Here we are hijacking that test in order to initialize our destination dim_Contacts table with data that we can then update via the second test and perform the required test assertion.
So as you can see in the above code only one Assert.AreEqual method is called to assert the second test.
Other tests
Apart from the two tests mentioned above, I have created a few more tests that tests several application scenarios all of which are too much to document in this article. However you can download the entire sample with all the tests from my
GitHub repository.
Running the test in Visual Studio Test Explorer
Before being able to run the tests, the unit test project needs to be built. When built all the methods with the attribute [TestMethod] attached to them will appear in the Test Explorer. You will then have a number of options on how to run them. You can also set all the tests to be run automatically when ever the solution is built which maybe useful for Continuous Integration builds. For more information about how to use the Visual Studio Test Explorer please refer to this
MSDN link.
To view the Test Explorer, in Visual Studio simple select the Test menu, then Windows, the Test Explorer. You should see a window similar to the one bellow:
To run an individual test simply right click on a test and select 'Run Selected Test'. The test will execute and if passed will be indicated with a greed tick. If failed it will be indicated with a red exclamation mark.