Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 25. August 2013 08:54

About a month ago, FSharp.Data  released version 1.1.9, which contains some very nice improvements – you can find them listed on Gustavo Guerra’s blog. I was particularly excited by the changes made to the CSV Type Provider, because they make my life digging through datasets even simpler, but couldn’t find the time to write about it, because of my recent cross-country peregrinations.

Now that I am back, let’s talk about why this update made me so happy, with a concrete example. My latest week-end project is an F# implementation of Random Forests; as part of the process, I am trying out the algorithm on various datasets, to get a sense for potential performance problems, and dog-food my own API, the best way I know to quickly spot suckiness.

One of the problems I ran into was the representation of missing values. Most datasets don’t come clean and ready to use – usually you’ll have a few records with missing data. I opted for what seemed the most straightforward representation in F#, and decided to represent every feature value as an Option – anything can either have Some value, or None.

The original CSV Type Provider introduced a bit of friction there, because it inferred types “optimistically”: if the sample used contained only integers, it would create an integer, which is great in most cases, except when you want to be “pessimistic” (which is usually a safe world-view when setting expectations regarding data).

The new-and-improved CSV Type Provider fixes that, and introduces a few niceties. Case in point: the Kaggle Titanic dataset, which contains the Titanic’s passenger list. With the new version, extracting the data is as simple as this:

type DataSet = CsvProvider<"titanic.csv", 
                           Schema="PassengerId=int, Pclass->Class, Parch->ParentsOrChildren, SibSp->SiblingsOrSpouse", 

type Passenger = DataSet.Row

This is pretty awesome. In a couple of lines, just by passing in the path to my CSV file and some (optional) schema information, I get a Passenger type:


What’s neat here is that first, I immediately get a Passenger with properties – with the correct Optional types, thanks to SafeMode and PreferOptional. Then, notice in the Schema the Pclass->Class, Parch->ParentsOrChildren, SibSp->SiblingsOrSpouse bit? This renames “on the fly” the properties; instead of the pretty obscurely named Parch feature coming from the CSV file header, I get a nice and readable ParentsOrChildren property. The Type Provider even does a few more cool things, automagically; for instance, the feature “Survived”, which is encoded in the original dataset as 0 or 1, gets automatically converted to a boolean. Really nice.

And just like that, I can now use this CSV file, and send it to my (still very much in alpha version) Decision Tree classifier:

// We read the training set into an array,
// defining the Label we want to classify on:
let training =
    use data = new DataSet()
    [| for passenger in data.Data -> 
        passenger.Survived |> Categorical, // the label
        passenger |]
// We define what features should be used:
let features = [|
    "Sex", (fun (x:Passenger) -> x.Sex |> Categorical);
    "Class", (fun x -> x.Class |> Categorical); |]
// We run the classifier...
let classifier, report = createID3Classifier training features { DefaultID3Config with DetailLevel = Verbose }
// ... and display the resulting tree:

… which produces the following results in the F# Interactive window:

> titanicDemo();;
├ Sex = male
│   ├ Class = 3 → False
│   ├ Class = 1 → False
│   └ Class = 2 → False
└ Sex = female
   ├ Class = 3 → False
   ├ Class = 1 → True
   └ Class = 2 → True
val it : unit = ()

The morale of the story here is triple. First, it was a much better idea to be a rich lady on the Titanic, rather than a (poor) dude. Then, Type Providers are really awesome – in a couple of lines, we extracted from a CSV file a collection of Passengers, all of them statically typed, with all the benefits attached to that; in a way, this is the best of both worlds – access the data as easily as with a dynamic language, but with all the benefits of types. Finally, the F# community is just awesome – big thanks to everyone who contributed to FSharp.Data, and specifically to @ovatsus for the recent improvements to the CSV Type Provider!

You can find the full Titanic example here on GitHub.

by Mathias 10. November 2012 11:00

The Kaggle/StackOverflow contest officially closed a few days ago, which makes it a perfect time to have a miniature retrospective on that experience. The objective of the contest was to write an algorithm to predict whether a StackOverflow question would be closed by moderators, and the reason why.

The contest was announced just a couple of days before what was supposed to be 4 weeks of computer-free vacation travelling around Europe. Needless to say, a quick change of plans followed; I am a big fan of StackOverflow, and Machine Learning has been on my mind quite a bit lately, so I packed my smallest laptop with Visual Studio installed. At the same time, the wonders of the Interwebs resulted in the formation of Team Charon - the awesome @lu_a_jalla and me, around the loosely defined project of "having fun with this, using 100% F#".

Now that the contest is over, here are a few notes on the experience, focusing on process and tools, and not the modeling aspects – I’ll get back to that in a later post.

This is my first unquestionably positive experience with a dispersed team - every morning I was genuinely looking forward to code check-ins, something I can't say of every experience I have had with remote teams. I recall reading somewhere that there was only one valid reason to work with a dispersed team: when you really want to work with that person, and it is the only way to work together. I tend to agree, and this was tremendously fun. There are not that many opportunities to have meaningful interactions involving both F# and Machine Learning, and I learnt quite a bit in the process, in large part because this was team work.

As a side note, I find it amazing how ridiculously easy it is today to set up a collaborative environment. Set up a GitHub repository, use Skype and Twitter – and you are good to go. The only thing technology hasn’t quite solved yet are these pesky time zones: Minsk and San Francisco are still 11 hours apart. This is were a team of night owls might help…

Whenever there is a deadline, make sure the when and what is clear. Had I followed this simple rule, I would have been on time for the final submission. Instead, I missed it by a couple of hours, because I didn't check what "you have three days left" meant exactly, which is too bad, because otherwise we could have ended up in 27th position, among 160+ competitors:


... which is a result I am pretty proud of, given that this was my first “official” attempt at Machine Learning stuff, and some of the competitors looked pretty qualified. During the initial phase, we went as high as 10th position, and ended up in 40th position, in the top 25%.



Comment RSS