Feb 3, 2010

Data profiling in C# with SSIS

This is a simple wrapper for SSIS Data Profiling Task. The intention is to use profiling from a .Net library without using (Business Intelligence Development Studio).

Profiling task is used to compute data profiles according to different rulesets. Rulesets available in 2008 Integration Services are:
* Candidate key profile -- for finding natural keys in the data
* Column length distribution profile -- For checking data lengths
* Column null ratio profile -- For checking how many null values each field has.
* Column pattern profile --
     - For generating patterns from data. For example, if column has always 4 successive numbers the result is \d{4}. The results are regular expressions, that you can use elsewhere to check if the data is similar to what's in the column.
* Column statistics profile -- For statistics such as minimum, maximum, average, and standard deviation for numeric columns, and minimum and maximum for datetime columns.
* Column value distribution profile -- Reports all the distinct values in the selected column and the percentage of rows in the table that each value represents.
* Functional dependency profile -- A Functional Dependency profile reports the extent to which the values in one column depend on the values in another column or set of columns.
* Value inclusion profile -- Value Inclusion profile computes the overlap in the values between two columns or sets of columns.

You can find more information on what the profiler actually does from Jamie Thomson's blog. Unfortunately it has moved without the old posts, but you can still find (at least some of them) in archive.org .

Other resources:
http://dougbert.com/blogs/dougbert/archive/2009/01/20/learn-more-about-data-profiling-in-sql-2008-a-compilation.aspx

-==- Database -==-

What would data profiling be without data? You can download Microsoft's sample database Adventure Works 2008 from here.

Running the profiler requires that Integration services is installed on the machine and that requires Standard or higher SQL Server. Developer edition will do fine for this; if you're a student, you can get it from https://www.dreamspark.com/ and if you have a MSDN subscription, you probably have a downloadable version of developer edition from there.

-==- Accessing SSIS package programmatically -==-

Accessing SSIS runtime and needed classes requires a few assemblies. These come with the SQL Server sdk that's included in the SQL Server installation package. (And they are also included in attached solution, so you don't have to install the SDK if you already have 2008 integration services installed.)

Microsoft.SqlServer.DataProfilingTask.dll -- The task itself
Microsoft.SqlServer.DataProfiler.dll -- Profiling rulesets and their support classes
Microsoft.SQLServer.ManagedDTS.dll -- Access to SSIS runtime

When you've installed the sdk, you can find the assemblies in:
C:\Program Files\Microsoft SQL Server\100\SDK\Assemblies\

I've cheated a little and created a package that has a single task that is the Data Profiling Task. You can also create it and the DataProfilignTask programmatically, but I'm skipping that here.
I've also added three variables to the package, so we can easily change connection strings and output location.

// initialize SSIS application runtime
var app = new Microsoft.SqlServer.Dts.Runtime.Application();

// load the package
var pkg = app.LoadPackage(@"package.dtsx", null);

// the package contains only one task. loading it.
var task = pkg.Executables[0] as TaskHost;
var dpt = task.InnerObject as DataProfilingTask;

// get connections
var connectionDB = pkg.Connections[0];
var connectionTagDB = pkg.Connections[2];

// connection string for SQL Server. assuming that we're using localhost.
pkg.Variables["DBConnectionString"].Value =
    "Data Source=.;Initial Catalog=AdventureWorksDW2008;Integrated Security=SSPI;"
  ;

// connection string for TagDB (see ColumnPatternProfileRequest for more details)
pkg.Variables["TagDBConnectionString"].Value = "";
// output file location. we'll create a temporary file.
pkg.Variables["FileConnectionString"].Value = System.IO.Path.GetTempFileName();

// Now we've setup the package, let's add profilers.

// create new Candidate Key Profile query
var keyprofileRequest = new Microsoft.DataDebugger.DataProfiling.CandidateKeyProfileRequest();
// set the id for request
keyprofileRequest.RequestID = Guid.NewGuid().ToString();
// set the database id
keyprofileRequest.DataSourceID = connectionDB.ID;
// select table
keyprofileRequest.Table = new Microsoft.DataDebugger.DataProfiling.TableQName("DimCurrency");
// add column(s)
keyprofileRequest.KeyColumns.Add(Microsoft.DataDebugger.DataProfiling.ColumnParameter.WildCard);

// set profiling attributes
keyprofileRequest.KeyStrengthThreshold = 0.95;
keyprofileRequest.MaxNumberOfViolations = 100;
keyprofileRequest.ThresholdSetting = Microsoft.DataDebugger.DataProfiling.StrengthThresholdSetting.Specified;

// add the request to request collection so it will be executed.
dpt.ProfileRequests.Add(keyprofileRequest);

// execute package
var result = pkg.Execute();

// if everything goes fine, result should be DTSExecResult.Success
if (result == DTSExecResult.Success)
{
    // print out the result file location
    System.Console.WriteLine(pkg.Variables["FileConnectionString"].Value);
}
else
{
    // Something went wrong, print errors.
    System.Console.WriteLine("Package failed. Errors:");
    foreach (var error in pkg.Errors)
    {
        System.Console.WriteLine("{0}: {1}", error.ErrorCode, error.Description);
    }
}

Now, running the package should create for you a xml-file that you can open with DateProfileViewer.
It's located at C:\Program Files\Microsoft SQL Server\100\DTS\Binn -folder.

As a result from DimCurrency tables column analysis we can see that all the columns contain unique data and can (but not all probably shouldn't) be used as a natural key.


You can use all the other ProfileRequest-classes similarly:
Microsoft.DataDebugger.DataProfiling.CandidateKeyProfileRequest
Microsoft.DataDebugger.DataProfiling.ColumnLengthDistributionProfileRequest
Microsoft.DataDebugger.DataProfiling.ColumnNullRatioProfileRequest
Microsoft.DataDebugger.DataProfiling.ColumnPatternProfileRequest
Microsoft.DataDebugger.DataProfiling.ColumnStatisticsProfileRequest
Microsoft.DataDebugger.DataProfiling.ColumnValueDistributionProfileRequest
Microsoft.DataDebugger.DataProfiling.FunctionalDependencyProfileRequest
Microsoft.DataDebugger.DataProfiling.InclusionProfileRequest

Few notes on the interfaces:
- Some of the ProfileRequests support profiling all the fields. You can use Microsoft.DataDebugger.DataProfiling.ColumnParameter.WildCard for that. Using a single field is new Microsoft.DataDebugger.DataProfiling.ColumnParameter("FieldNameHere"))
- Profile request's DataSourceID is always a reference to database connections ID in the package.

Otherwise, the programming interface closely resembles what you see on Business Intelligence Development Studio.


-==- C# 4.0 / Visul Studio 2010 -==-

Unfortunately, I couldn't get this to work with VS2010. It seems that COM-object handling is somewhat changed or there was an assembly mismatch. But if you don't need static typing, you can always use dynamic instead of casting InnerObject to DataProfilingTask and it should work similarly.

-==- Sources -==-

Here.

0 comments:

Post a Comment