What you need is a existing 3rd party implementation for IFilter and there are a few free ones. Couple I would like to mention are:
- Foxit PDF IFilter (desktop edition is free of charge; all you have to is register)
- Adobe Acrobat IFilter -- Just install Adobe Acrobat Reader 9
- Citeknet CHM IFilter
I'm using Visual Studio 2010 RC and C# 4.0, which you can download here.
-==- "Map/Reduce" -==-
As I didn't want to parse all the files each time I searched and caching the text output of IFilter would add a significant amount of space requirements to my app, I decided to do a simple wordcount statistic that would be the basis of my search. (Note. I wouldn't suggest using such a simplistic search in production.)
This sample isn't distributed as Map/Reduce should be, but replacing the local Parser with, for example, a WCF remote parser method, you can distribute this to as many computers as you need. Be aware that transferring the original file or the output of "format to text" conversion (depending which you wan't to be responsible for parsing the original file to text) to a remote destination needs enough upload bandwidth to make it faster then local execution.
Downloading the response shouldn't be a strain on the network as it should be only a small percentage of the original output length.
Prototype of reducing the parsed words:
from word in Text where word.Length > 2 && [and has only characters]
group word by word into grp
select new { Word = grp.Key, Count = grp.Count() }
--> where Text is the original text splitted to words by whitespace.
-==- IFilter parser -==-
I borrowed Jason Zander's example of parsing PDF output with IFilter and put it to work here.
Note. Jason's code is licensed under MICROSOFT PUBLIC LICENSE (Ms-PL) so before thinking of reusing my sample, please read the license.
I modified the code to return a IEnumerable<string> instead of the whole buffer, so I could simply use a Linq query to it (see "prototype" above).
-==- How the sample works -==-
1. We loop through given directory and find all supported files (pdf, chm).
2. We use Jason's FilterCode and group all the words to a wordcount.
[here we could save the results to a database but for sample purposes, we just keep them in memory for now]
3. We try to find given keywords from reduced sets and list them in descending order by wordcount as search results.
The attached console app uses Mono.Options to parse command line arguments.
Here's a sample usage:

-==- Code -==-
And now, finally, some code :)
I'm not going to present the filter itself, as it's long and complex part, but you can check it out at MSDN Code or from my attached source code, but here's a few interesting parts.
Grouping a set of words to a list of word counts:
public IEnumerable<WordCountItem> Reduce(IEnumerable<string> words) {
return from word in words
where word.Length > 2
group word by word.ToUpper() into grp
select new WordCountItem(grp.Key, grp.Count())
;
}
I'm filtering too short words or letters (2 or less characters) out, and returning a list of unique words and their number of appearances in the list.
The more complex query was with the actual search, but it probably was because of my overly complex storage type of file and wordcount list ("string, List<WordCountItem>").
public IEnumerable<WordCountResult> Find(List<string> keywords) {
// filter out everything but the found keywords and their word counts.
var keywordsQuery = from words in WordCountRepository.Query()
from word in words.Value
where keywords.Contains(word.Word.ToLower())
select new { Filename = words.Key, Keyword = word.Word, Count = word.Count }
;
// Group the results and calculate average count per keyword,
// returning only one instance of filename and it's average found word count.
// Return the results with highest average count first.
return from g in keywordsQuery
// we are using Count as a field for group operations (sum, average, count) and Filename for "key" value.
group g.Count by g.Filename into grp
orderby grp.Average() descending
select new WordCountResult() { Filename = grp.Key, AvgCount = grp.Average() }
;
}
That's it. Quite simple. One other thing I would like to endorse is Moni.Options. It made console argument handling very easy. Here's what I needed:
var options = new OptionSet() {
{ "d|directory=", "Root {DIRECTORY} that contains books", v => directory = v },
{ "e|extensions=", "Supported file {EXTENSIONS}. Separate with commas. Default is chm and pdf.", v => supportedFiletypes.AddRange(v.Split(',')) },
{ "k|keywords=", "{KEYWORDS} to find. Separate with commas.", v => keywords.AddRange(v.Split(',')) },
{ "t|top=", "Show top {COUNT} results. Default is 10.", v => top = int.Parse(v) },
{ "test=", "Test if {file} is supported by IFilter", v => { testIFilterSupport = v != null; testFilename = v; } },
{ "h|help", "Show this help", v => showHelp = v != null }
};
options.Parse(args);
The final argument in an option is lambda action (Action<string>) that operates on given argument. As you can see, I've used it for multiple purposes here. For -h (help), which has no arguments, I've simply using it to set a boolean showHelp flag. For comma separated values like keywords and supported filetypes, I've used it to split the keywords to an outer (from the OptionSets view point) list.
Simple and effective. And if you need to print out all possible options, you can just say:
options.WriteOptionDescriptions(System.Console.Out);
-==- Other types -==-
IFilter supports a multitude of types and I would recommend you check out the IFilter explorer, if you haven't already, so you can see all the filters you have already installed on your machine.
For example, you can use it to search through your .cs -files:

-==- Source -==-
..can be found here. I also included the two pdfs show above to the console project for testing purposes. And remember, when running the console app in visual studio, to set command line arguments in console projects debug-tab.
0 comments:
Post a Comment