Finally, some experiences on testing. I'm using the sample code from
previous post,
except I modified it a bit to use more current input/output components. Previous version used the old NHibernate versions, but replacing them was pretty straightforward.
Note that all the test cases are more scenario based tests then actual unit tests, as the components are much larger then regularly tested by a single unit test _and_ that they also require an external component (in this case the database).
-==- Testing the dataflow -==-
Primary parts of the dataflow are:
- loading data from CSV
- skipping invalid rows
- inserting loaded date to table
Testing the dataflow is the easy part as it's mostly separated from anything else except the database. Unfortunately, as databases differ, there's no simple way to abstract away using real database,
so we'll assume for now that there's a local version of the database on (each) developer's machine.
For testing, I'm using Visual Studio Unit Testing Framework, eg. the one that comes with any VS version.
Ok, so first, let's see what's common between all the tests and setup them as initialize/cleanup:
private string workingDir = null;
private const string connectionString = @"Data Source=localhost;Initial Catalog=Test;Integrated Security=SSPI;";
private DynamicSQL sql = null;
[TestInitialize]
public void Initialize()
{
workingDir = Path.GetDirectoryName(Assembly.GetAssembly(typeof(DataFlowTests)).Location);
sql = new DynamicSQL(connectionString, "System.Data.SqlClient");
// clean up the table
sql.ExecuteSQL("truncate table comic");
}
[TestCleanup]
public void Cleanup()
{
if (sql != null) sql.Dispose();
}
Here we'll setup the working directory; that is, where we'll be getting our input files from. Also, we'll open a database querying tool which we'll use to verify the input results. Also, to assure that the database will be in correct state before each test, we'll clear up the table between each test case.
Then on to the actual tests.
If you'll remember our input file is not a straight up CSV file, but contains rows that are not part of the data we will import. Therefore, we'll verify that only the data is imported.
I downloaded the latest list from http://previewsworld.com/shipping/archive/2010/051910.txt and took only a part of it for testing. There are 12 correct rows and 12 lines of either empty rows or some text.
I've added the file to my solution with "Copy to output directory" set to "Copy always", so we can access it from build directory (that's why we set working directory to the same directory as the assembly location).
[TestMethod]
public void dataflow_should_skip_rows_that_doesnt_have_3_columns()
{
// simulate datatime deducing
var fileDateTime = new DateTime(2010, 5, 19);
// load data
var df = new ComicLoadDataflow(
Path.Combine(workingDir, "051910_partial.txt"),
fileDateTime,
connectionString);
var task = df.Execute();
task.Wait();
// verify that we loaded only the 12 correct rows
Assert.AreEqual(12, sql.Single("select count(0) Count from comic").Count);
}
After verifying that we actually get rows to the database, we'll verify that the data is inserted into the right columns. For this, I'm using an input file containing only one row.
[TestMethod]
public void verify_that_csv_columns_are_placed_in_right_columns()
{
// simulate datatime deducing
var fileDateTime = new DateTime(2010, 5, 19);
// load data
var df = new ComicLoadDataflow(
Path.Combine(workingDir, "051910_one_row.txt"),
fileDateTime,
connectionString);
var task = df.Execute();
task.Wait();
// load loaded row from database
var row = sql.Single("select * from comic");
// verify columns
Assert.AreEqual<string>("MAR100023", row.DiamondIdentifier);
Assert.AreEqual<string>("AVP THREE WORLD WAR #4 (OF 6)", row.Name);
Assert.AreEqual<string>("$3.50", row.Price);
Assert.AreEqual<DateTime>(fileDateTime, row.ReleaseDate);
}
Then on to the hard part. Our data loading uses SCD so we'll need two files. An original file and updated file that contains new rows and updates to existing rows.
[TestMethod]
public void new_rows_in_csv_should_create_new_rows_in_db_and_changed_rows_should_be_updated()
{
// simulate datatime deducing
var fileDateTime = new DateTime(2010, 5, 19);
// load data
var df = new ComicLoadDataflow(
Path.Combine(workingDir, "051910_scd_part1.txt"),
fileDateTime,
connectionString);
var task = df.Execute();
task.Wait();
Assert.AreEqual(12, sql.Single("select count(0) Count from comic").Count);
// update the datetime to simulate a new file coming the next day.
fileDateTime = fileDateTime.AddDays(1);
// load inserted and updated rows
df = new ComicLoadDataflow(
Path.Combine(workingDir, "051910_scd_part2.txt"),
fileDateTime,
connectionString);
task = df.Execute();
task.Wait();
// verify updates
var changedRow = sql.Single("select * from comic where diamondidentifier = 'DEC090058'");
Assert.IsNotNull(changedRow);
Assert.AreEqual<string>("OH MY GODDESS RTM TP VOL 35", changedRow.Name);
Assert.AreEqual<DateTime>(fileDateTime, changedRow.ReleaseDate);
var insertRows = sql.Single("select count(0) Count from comic where diamondidentifier in ('MAR100302', 'MAR100296', 'JAN100295')");
// verify that we had 3 new rows
Assert.AreEqual(3, insertRows.Count);
}
That's it. If you want, you can make the tests more explicit; I've used only row counts to verify inserts, so they might not be 100% accurate.
Running the tests is not as fast as running normal unit tests as there is an round-trip time to the database, but it's still fast enough to run whenever you want to.
-==- Testing the control flow -==-
Primary parts of the control flow are
1. create an archive directory if one does not exist
2. loop through all txt-files in input directory
3. parse DateTime from file name and if not possible, break control flow execution
4. execute dataflow for each file
5. move file to archive
Testing the control flow is a bit harder then the dataflow. Data flow is not currently injectable (which is a problem I need to solve), so I had to rework a bit of the control flow.
So, instead of newing up the dataflow inside the control flow I made a virtual function that we can override for mocking:
public virtual Dataflow CreateComicLoadDataflow(string file, DateTime parsedDateTime, string connectionString)
How to do this correctly is a problem for a later date. For now this should suffice for this simple exercise.
For the next part, I'm using OS mocking framework
Moq.
Settings up and cleaning up the tests:
private Mock<Dataflow> df = null;
private Mock<ComicLoaderControlFlow> cf = null;
private string tmpInputpath = null;
ETLParameters parameters = null;
[TestInitialize]
public void Initialize()
{
// create a mockup for dataflow
df = new Mock<Dataflow>(MockBehavior.Strict);
// setup dataflow execution to always to an empty task
df.Setup(d => d.Execute()).Returns(() => {
var task = new System.Threading.Tasks.Task(() => { Thread.Sleep(100); });
task.Start();
return task;
});
// create a partially mocked controlflow
cf = new Mock<ComicLoaderControlFlow>(MockBehavior.Strict);
// setup dataflow creation to return previously mocked dataflow
cf.Setup(c => c.CreateComicLoadDataflow(It.IsAny<string>(), It.IsAny<DateTime>(), It.IsAny<string>())).Returns(
df.Object
);
// setup input path
tmpInputpath = Path.Combine(Path.GetTempPath() + "TestDir");
Directory.CreateDirectory(tmpInputpath);
parameters = new ETLParameters();
}
[TestCleanup]
public void Cleanup()
{
// delete the whole input path directory recursively.
if (Directory.Exists(tmpInputpath))
Directory.Delete(tmpInputpath, true);
}
Let's verify step 1.
[TestMethod]
public void control_flow_creates_archive_directory()
{
var parameters = new ETLParameters();
string archiveDir = Path.Combine(tmpInputpath, "Archive");
parameters.Set("DirectoryComics", tmpInputpath);
parameters.Set("ConnectionString", "not empty");
// execute the dataflow
var result = cf.Object.Execute(parameters);
// verify that archive directory was created
Assert.IsTrue(Directory.Exists(archiveDir), "Control flow should have created an archive directory.");
Assert.AreEqual(ControlFlowResultType.Succeeded, result.Result);
}
Step 2 (looping (only) text files), step 3 part 1 (parse datetime succesfully) and step 4, execute dataflow for each file.
[TestMethod]
public void control_flow_processes_all_txt_files()
{
// create a file to be processed
File.CreateText(Path.Combine(tmpInputpath, "123010.txt")).Close();
File.CreateText(Path.Combine(tmpInputpath, "123110.txt")).Close();
File.CreateText(Path.Combine(tmpInputpath, "123110.csv")).Close();
parameters.Set("DirectoryComics", tmpInputpath);
parameters.Set("ConnectionString", "not empty");
var result = cf.Object.Execute(parameters);
string archiveDir = Path.Combine(tmpInputpath, "Archive");
// verify that only the two text files are processed
df.Verify(d => d.Execute(), Times.Exactly(2));
}
Yes, you should probably do this to 2-3 parts, because if one of the two cases fails, you won't know which one it was. Unfortunately, we don't have enough seems in the code, so this'll have to suffice for now.
Step 3 part 2.
[TestMethod]
public void control_flow_fails_if_filename_is_not_a_date()
{
// create a file to be processed
File.CreateText(Path.Combine(tmpInputpath, "fail.txt")).Close();
parameters.Set("DirectoryComics", tmpInputpath);
parameters.Set("ConnectionString", "not empty");
// execute the dataflow
var result = cf.Object.Execute(parameters);
Assert.AreEqual(ControlFlowResultType.Failed, result.Result, "Control flow should fail if filename is not a data.");
df.Verify(d => d.Execute(), Times.Never());
}
Step 5. moving file to archive.
[TestMethod]
public void control_flow_moves_processed_file_to_archive()
{
string archiveDir = Path.Combine(tmpInputpath, "Archive");
// create a file to be processed
File.CreateText(Path.Combine(tmpInputpath, "123110.txt")).Close();
parameters.Set("DirectoryComics", tmpInputpath);
parameters.Set("ConnectionString", "not empty");
// execute the dataflow
var result = cf.Object.Execute(parameters);
// verify that archive directory was created
Assert.IsTrue(File.Exists(Path.Combine(archiveDir, "123110.txt")), "File wasn't moved to archive.");
Assert.AreEqual(ControlFlowResultType.Succeeded, result.Result);
}
-==- Conclusion -==-
Dataflow testing is simple, control flow is still on the wrong level (and needs to be split up) for testability and needs to be refactored.