Apr 7, 2010

Creating your own ETL framework

Now where did that vacation time go? I was supposed to get some coding done, but the time flew faster then anticipated so here's a little bit of background instead.

-==- What's ETL? -==-

ETL stands for Extract, Transform, Load. ETL is usually used to load data from one (or multiple) source(s), like a CSV-files, perform some transformations to fit it to another data storage, clean it and load it in, for example into a database. I recommend reading the Wiki article first, if you're not familiar with the concept.

-==- What do we need for the framework? -==-

Basic ETL operation consists of at least one input and one output and may contain transformations. More complex ETL operations can of course contain multiple inputs and ouputs and definitely have multiple transformations.

I've used SQL Server Integration Services (SSIS) for several years and my views on ETL are mostly based on that. I'm no expert, but I hope my rant will help someone else or if not, hopefully you'll be confused enough to google more resources :)

I've used ETL for Data warehousing and general data storage. Most of the pain points using SSIS have been it's testability and logic operations (if/then/else), which is why I wanted to create custom framework.

-==- So what do we want and how do we achieve it? -==-
  1. Testability
  2. Performance
  3. Extendability
  4. Easy to use and deploy
 * Testability *

Testability was really my main reason for doing this. Minor changes in larger ETL chains can cause large problems if done incorrectly. Testing the whole logic of the ETL chain each time manually will take a long time and can leave a lot to be desired. And everytime you find bugs in production, means you have most likely broken lots of data and, if you're lucky, you have to fix the bug, restore to a previous backup and re-run all the data again. If you're unlucky, you find it out too late or the data is not reproducible, so you have to manually clean up what you can and delete the rest.

- Testing components and data flows -

Testing the transformations are easy. You can always input test data and verify the output. (See source code for my component tests.)

Inputs aren't usually the problem, but they should be interchangeable within a category (file vs. database). Then we can for example swich SQL Server to SQLite during testing or file for an in memory stream.

Outputs are bit more complex, as the result of the output can be inserts (file and database) and updates (database) to the base storage. Of course, this is only the basic usage; output's could be for all I know web service calls. My output testing strategy is still on the drawing board and hopefully it'll be more concrete after few more iterations.

 - Testing the whole -

As you can see, single component and single "data flow" (starts with inputs, ends with outputs) testing is usually straightforward. Usually though we have multiple flows within a single ETL whole. In SSIS the whole is split into three parts: packages that contain control flows (loops, if/else logic, creating files etc) and data flows (inputs --> transformations --> outputs). Packages can be run individually or from SQL Server job in batches. Executing the package runs the contained control flow and all the data flows that according to logic should be executed.

To do full "beginning to end" scenario testing, we'll need some persistant lightweight database, so we can really operate on the data. I'm thinking of using SQLite as a persistant (during testing scenario) storage. Of course, using a local instance of MSSQL or any other database will suit the purpose, but using an easily cleanable database is a must.

To easily replace real inputs and ouputs, data flows need to support Input/output injection.

* Performance *

Usually the performance bottlenecks are inputs and outputs, specifically, databases. Not much we can do on this front, but we stand on the same line as everyone else on this.

* Extendability *

Creating new components should be easy enough. I'm using Reactive Extensions as the basis of my data flow and churned out a dozen components in couple of hours.

* Easy to use and deploy *

I'm still working (well, thinking) on both of these. Having components, that have multiple inputs and outputs, are hard to abstract away to the same level as everything else. I'm taking all suggestions on how to better create Dataflow components (see below for a sample).

Deployment is still on the drawing board. I'm thinking of using couple different strategies:

1) Using an commandline executable to execute ETL-containers
2) Using a windows service to execute ETL-containers on schedule

Both of these choises are probably going to use MEF (Managed Extensibility Framework) to find deployed ETL-containers and run them.
   
-==- Sample usage -==-

public class PersonDataflow : Dataflow
{
  public PersonDataflow()
  {
    // we're reading a CSV file named persons.txt
    var input = new CSVInput<Person>(@"persons.txt");
    RegisterInput(input);

    var dc = new DerivedColumnTransformation<Person>(
    input.OutputStream, a => a.Firstname = a.Firstname ?? "No firstname");

    var output = new CSVOutput<Person>(dc.OutputStream, @"persons_transformed.txt");
    RegisterOutput(output);
  }
}

[DelimitedRecord(";")]
public class Person
{
  public string Firstname { get; set; }
  public string Lastname { get; set; }
  public string StreetAddress { get; set; }
  public string ZipCode { get; set; }
  public string PostOffice { get; set; }
}

static void Main(string[] args)
{
  var pdf = new PersonDataflow();
  var pdfTask = pdf.Execute();
  pdfTask.Wait();

  System.Console.WriteLine("Done.");
  System.Console.ReadKey();
}

CSVInput and CSVOutput components are based on FileHelpers CSV library.
They will create an Person objects for each row and vice versa.

DerivedColumnTransformation is a simple transformation that will replace null Firstnames with the text "No firstname".

Executing the task will create a Task (from Parallel Task Library), meaning the data flow is executing on a separate thread. Then we can either execute more DFs or just wait for it to complete.
These input/output components operate on staticly typed classes (Person), but I'm thinking how to move my couple of previous prototypes on how to query CSV and SQL dynamicly into this.

-==- Other frameworks -==-

* SQL Server Integration Services *

- Advantages -
  • Graphical user interface
  • Easy to use for simple ETL scenarios without any programming knowledge
  • Integrates with SSAS (SQL Server Analysis services)
  • Easy to use and deploy (moving components between locations is somewhat complex)
  • Many automaticly working components like logging
  • .. and much more.
- Disadvantages -

  • Hard to unit test
  • More complex solutions require creating C#/VB "script componets" to handle string manipulations, logic etc.
  • Logic operations make the flow overly complex and hard to manage
So what I'm replacing with my framework idea is a nice user interface for everyone with testable, easy to use framework for programmers. This of course won't work on most cases where ETL is done by non-programmers, but if you're doing ETL and you are proficient enough in .Net then I hope it'll be useful to you.

* Rhino.ETL *

I haven't used Rhino.ETL myself, but it seems to be aiming at the same direction then I am, though a bit differently. If you're interested in programmatic ETL, I'd suggest taking a look at it.

http://ayende.com/Blog/archive/2008/01/16/Rhino-ETL-2.0.aspx

0 comments:

Post a Comment