Jan 27, 2010

C# 4.0 - Dynamic Linq to CSV

--==-- Dynamic Linq to CSV --==--

I wanted to try the new capabilities of dynamic keyword in .Net Framework 4.0 and this is what I came up with; dynamically querying CSV-files. CSV is simple two dimensional data format, so we can focus more on what you can do then on the format itself. So the main point is dynamic Linq queries and not CSV, therefore the implementation is not 100% complete. For example, multiline columns are not supported and this is a read-only implementation.

--==-- CSV- and class-structure --==--

You can find basic information on CSV-format here. Basically what we have are rows and cells incide rows. Cells are separated inside a row by a separator character; normally, as the name says, a comma. Cells are enclosed in quotes if value inside cell contains the separator character or a quote. Quotes as value inside cell is transformed to double-quote.

For example:
first column,second column,"special, character","He said ""Hello"""

I've used DynamicCSV as the main class that uses DynamicCSVEnumerator (GetRows) to loop DynamicCSVRow objects. When accessing properties of DynamicCSVRow, DynamicCSVCell objects are created.

--==-- Sample usage --==---

// Load CSV-file using comma as column separator and use header row to name columns.
dynamic csv = new DynamicCSV(/* filename: */ @"c:\names.csv", 
                                                    /* separator: */ ',', 
                                                    /* has header: */ true);

// Query from CSV where column name is "Firstname" and value in cell is "Onni".
var query = (from dynamic c in csv.Rows() where c.FirstName == "Onni" select c);

// Loop all found rows and print FirstName and LastName values from the row.
foreach (var row in query) {
  Console.WriteLine("{0} {1}", row.FirstName, row.LastName);
}

The key to using DynamicObject in Linq query is using the dynamic keyword. Without it the query fails at compile time as it can't find FirstName property. Also, using DynamicObject in place of a data source is not possible. For example:
// This will cause an compile time exception.
dynamic rows = csv.Rows();
var query = (from dynamic c in rows where c.FirstName == "Onni" select c);

Now, access to "FirstName" property is gotten through TryGetMember in DynamicCSVRow class. The property name "FirstName" is itself gotten from the header row of the CSV-file and stored. There are caveats of course. Header column can contain characters that are not allow as a property name, so we have to filter them out or replace them. For example, we are replacing all whitespace with an underscore. You can see DynamicCSVColumnNameFormatter for details.

Let's move onto how exactly we are getting the values.

// We are storing the Row data in string array and metadata for it in Tuple Metadata property.
// Metadata contains the property's name and it's index in string array.
public override bool TryGetMember(GetMemberBinder binder, out object result) {

  // find the row data based on column name
  var member = Metadata.Where(m => m.Item2 == binder.Name).FirstOrDefault();

  // if found, return DynamicCSVCell
  if (member != null) {
    result = new DynamicCSVCell(member.Item2, ColumnData[member.Item1]);
    return true;
  } else {
    // nothing found
    result = null;
    return false;
  }
}

(See the full code for more details)

Similarly, if your CSV doesn't have a header, you can use index property.
var query = (from dynamic c in csv.Rows() where c[0] == "Onni" select c);
Index property is gotten through TryGetIndex method.

public override bool TryGetIndex(GetIndexBinder binder, object[] indexes, out object result) {
  // get index number. 
  int index = (int)indexes[0]; // we are assuming that the index is an int.
  // if there are enough columns, return new DynamicCSVCell
  if (index + 1 <= ColumnData.Length) {
    result = new DynamicCSVCell(Metadata[index].Item2, ColumnData[index]);
    return true;
  } else {
    // out of index range.
    throw new IndexOutOfRangeException();
  }
}

--==-- Custom typecasting --==--
Now that we have access to the data in form of DynamicCSVCell, we still have to convert it to a form that can be used with other data types. If you haven't read my previous post, check it out.
 
DynamicCSVCell implements implicit and explicit casts to multiple formats that we can use. It's user's responsibility to know what data is inside cell. This is dynamic after all.
 

// Implicit conversion to int. 
public static implicit operator int(DynamicCSVCell a) { 
  return int.Parse(a.Value);
}

You can also create more elaborate conversions as well. Here's a sample conversion function for DateTime:

public override bool TryInvokeMember(InvokeMemberBinder binder, object[] args, out object result) {

  switch (binder.Name.ToUpper())
  {
    case "ASDATETIME":
    // try parse with default format
    if (args == null) {
      result = DateTime.Parse(Value);
      return true;
    }
    // user provided formatprovider
    if (args[1] is IFormatProvider) {
      result = DateTime.Parse(Value, args[0] as IFormatProvider);
      return true;
    }
    // user provided format string
    if (args[1] is string) {
      result = DateTime.ParseExact(Value, args[0] as string, CultureInfo.InvariantCulture);
      return true;
    }
throw new ArgumentException("Either enter no arguments or provide IFormatProvider or string format for DateTime.");

  } 
  result = null;
  return false;
}

With this function you can do DateTime casting with a custom format:
var q = (from dynamic c in namesCSV select c.BirthDate.AsDateTime("d.M.yyyy"));

--==-- Linq functions --==--

Almost all of the Linq queries seem to work fine with dynamic objects.

Group by function:

// count of instances with same lastnames
var query = (from dynamic c in namesCSV.Rows() 
                   group c by c.LastName into grp
                   select new { LastName = grp.Key, Count = grp.Count() }
                  );

// sum row numbers per lastname.
var query = (from dynamic c in namesCSV.Rows() 
                   group c by c.LastName into grp
                   select new { LastName = grp.Key, Sum = grp.Sum(s => s.RowNumber) }
                  );

Unfortunately, Sum without group by is one of the special cases.

var query = (from dynamic c in namesCSV.Rows() select c.RowNumber);
query.Sum(); // won't work

"cannot convert from 'System.Collections.Generic.IEnumerable' to 'System.Linq.IQueryable'"
 
But instead:
// will work when you have a implicit cast to a type that Enumerable supports. query.Sum(s => s); 

See MSDN for details.

Min/Max are their own special cases.
 
var query = (from dynamic c in csv.Rows() select c.RowNumber); 
  
int min = query.Min(); // will use IComparable interface; if it's not implemented, throws an exception.  
  

But, implementing the IComparable interface is tricky, when you don't exactly know what the cell contains. For example, comparing strings is very different from comparing integers.
 
As a workaround, you can explicitly cast the value to int before summing them up.
 
int min = query.Min(s => (int)s); 
 
This will work without using IComparable on DynamicCSVCell class; instead it uses implicit (if implemented) or explicit cast of DynamicCSVCell to int. Either
public static implicit operator int(DynamicCSVCell dynamicCSVCell)
or
public override bool TryConvert(ConvertBinder binder, out object result)
 
Most of the rest of the cases work by the same principles as these. I haven't actually tried them all as there are quite a few of them, so all feedback is always welcome.
Basic principle is that anything that uses the individual field you have to be careful with. Conversion is not always obvious. Anything that uses the IEnumerable Linq extension methods like First, Last, Take etc will work automatically.

--==-- Joining queries --==--

Joining between CSV files or CSV files and other sources will work, but you have to remember that the default output of property is DynamicCSVCell and not he value, so comparing a property to a string are not equal. You should use explicit casting when comparing dynamic and non-dynamic values in join clause.

from dynamic c in namesCSV.Rows() join source on (int)c.RowNumber equals s

Also note that dynamic source must be the left side of the query:

from s in source join dynamic c in namesCSV.Rows()
on s equals c.RowNumber

..throws an compile time error: "An expression tree may not contain a dynamic operation".

--==-- Data binding in ASP.Net and Silverlight --==--

Unfortunately, data binding on DynamicObject does not work in ASP.Net or Silverlight, but According to Microsoft Connect it should work on WPF. This includes System.Dynamic.ExpandoObject that comes with the framework.


Though using dynamic objects in ASP.Net MVC is quite straightforward.

--==-- Source code and sample data --==--

Source code can be found here. The sample data can be found here.

names.csv: the list is fictional names taken from most popular Finnish first- and surnames. Any resemblance to persons living or dead is purely coincidental.

--==-- Feedback --==--


Hopefully someone will find this useful. Any feedback is appreciated.

1 comments:

  1. Toni I get an error:"Query expressions over source type 'dynamic' or with a join sequence of type 'dynamic' are not allowed" on the where clause for the linq query. Any ideas on how to fix, or what I maybe doing wrong?

    ReplyDelete