Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 25. August 2013 08:54

About a month ago, FSharp.Data  released version 1.1.9, which contains some very nice improvements – you can find them listed on Gustavo Guerra’s blog. I was particularly excited by the changes made to the CSV Type Provider, because they make my life digging through datasets even simpler, but couldn’t find the time to write about it, because of my recent cross-country peregrinations.

Now that I am back, let’s talk about why this update made me so happy, with a concrete example. My latest week-end project is an F# implementation of Random Forests; as part of the process, I am trying out the algorithm on various datasets, to get a sense for potential performance problems, and dog-food my own API, the best way I know to quickly spot suckiness.

One of the problems I ran into was the representation of missing values. Most datasets don’t come clean and ready to use – usually you’ll have a few records with missing data. I opted for what seemed the most straightforward representation in F#, and decided to represent every feature value as an Option – anything can either have Some value, or None.

The original CSV Type Provider introduced a bit of friction there, because it inferred types “optimistically”: if the sample used contained only integers, it would create an integer, which is great in most cases, except when you want to be “pessimistic” (which is usually a safe world-view when setting expectations regarding data).

The new-and-improved CSV Type Provider fixes that, and introduces a few niceties. Case in point: the Kaggle Titanic dataset, which contains the Titanic’s passenger list. With the new version, extracting the data is as simple as this:

type DataSet = CsvProvider<"titanic.csv", 
                           Schema="PassengerId=int, Pclass->Class, Parch->ParentsOrChildren, SibSp->SiblingsOrSpouse", 
                           SafeMode=true, 
                           PreferOptionals=true>

type Passenger = DataSet.Row

This is pretty awesome. In a couple of lines, just by passing in the path to my CSV file and some (optional) schema information, I get a Passenger type:

Titanic

What’s neat here is that first, I immediately get a Passenger with properties – with the correct Optional types, thanks to SafeMode and PreferOptional. Then, notice in the Schema the Pclass->Class, Parch->ParentsOrChildren, SibSp->SiblingsOrSpouse bit? This renames “on the fly” the properties; instead of the pretty obscurely named Parch feature coming from the CSV file header, I get a nice and readable ParentsOrChildren property. The Type Provider even does a few more cool things, automagically; for instance, the feature “Survived”, which is encoded in the original dataset as 0 or 1, gets automatically converted to a boolean. Really nice.

And just like that, I can now use this CSV file, and send it to my (still very much in alpha version) Decision Tree classifier:

// We read the training set into an array,
// defining the Label we want to classify on:
let training =
    use data = new DataSet()
    [| for passenger in data.Data -> 
        passenger.Survived |> Categorical, // the label
        passenger |]
// We define what features should be used:
let features = [|
    "Sex", (fun (x:Passenger) -> x.Sex |> Categorical);
    "Class", (fun x -> x.Class |> Categorical); |]
// We run the classifier...
let classifier, report = createID3Classifier training features { DefaultID3Config with DetailLevel = Verbose }
// ... and display the resulting tree:
report.Value.Pretty()

… which produces the following results in the F# Interactive window:

> titanicDemo();;
├ Sex = male
│   ├ Class = 3 → False
│   ├ Class = 1 → False
│   └ Class = 2 → False
└ Sex = female
   ├ Class = 3 → False
   ├ Class = 1 → True
   └ Class = 2 → True
val it : unit = ()
>

The morale of the story here is triple. First, it was a much better idea to be a rich lady on the Titanic, rather than a (poor) dude. Then, Type Providers are really awesome – in a couple of lines, we extracted from a CSV file a collection of Passengers, all of them statically typed, with all the benefits attached to that; in a way, this is the best of both worlds – access the data as easily as with a dynamic language, but with all the benefits of types. Finally, the F# community is just awesome – big thanks to everyone who contributed to FSharp.Data, and specifically to @ovatsus for the recent improvements to the CSV Type Provider!

You can find the full Titanic example here on GitHub.

by Mathias 5. July 2013 15:51

Besides having one of the coolest names around, Random Forest is an interesting machine learning algorithm, for a few reasons. It is applicable to a large range of classification problems, isn’t prone to over-fitting, can produce good quality metrics as a side-effect of the training process itself, and is very suitable for parallelization. For all these reasons, I thought it would be interesting to try it out in F#.

The current implementation I will be discussing below works, but isn’t production ready (yet) – it is work in progress. The API and implementation are very likely to change over the next few weeks. Still, I thought I would share what I did so far, and maybe get some feedback!

The idea behind the algorithm

As the name suggests, Random Forest (introduced in the early 2000s by Leo Breiman) can be viewed as an extension of Decision Trees, which I discussed before. A decision tree grows a single classifier, in a top-down manner: the algorithm recursively selects the feature which is the most informative, partitions the data according to the outcomes of that feature, and repeats the process until no information can be gained by partitioning further. On a non-technical level, the algorithm is playing a smart “game of 20 questions”: given what has been deduced so far, it picks from the available features the one that is most likely to lead to a more certain answer.

How is a Random Forest different from a Decision Tree? The first difference is that instead of growing a single decision tree, the algorithm will create a “forest” – a collection of Decision Trees; the final decision of the classifier will be the majority decision of all trees in the forest. However, having multiple times the same tree wouldn’t be of much help, because we would get the same classifier repeated over and over again. This is where the algorithm gets interesting: instead of growing a Tree using the entire training set and features, it introduces two sources of randomness:

  • each tree is grown on a new sample, created by randomly sampling the original dataset with replacement (“bagging”),
  • at each node of the tree, only a random subset of the remaining features is used.

Why would introducing randomness be a good idea? It has a few interesting benefits:

  • by selecting different samples, it mitigates the risk of over-fitting. A single tree will produce an excellent fit on the particular dataset that was used to train it, but this doesn’t guarantee that the result will generalize to other sets. Training multiple trees on random samples creates a more robust overall classifier, which will by construction handle a “wider” range of situations than a single dataset,
  • by selecting a random subset of features, it mitigates the risks of greedily picking locally optimal features that could be overall sub-optimal. As a bonus, it also allows a computation speed-up for each tree, because fewer features need to be considered at each step,
  • the bagging process, by construction, creates for each tree a Training Set (the selected examples) and a Cross-Validation Set (what’s “out-of-the-bag”), which can be directly used to produce quality metrics on how the classifier may perform in general.

Usage

Before delving into the current implementation, I thought it would be interesting to illustrate on an example the intended usage. I will be using the Titanic dataset, from the Kaggle Titanic contest. The goal of the exercise is simple: given the passengers list of the Titanic, and what happened to them, can you build a model to predict who sinks or swims?

I didn’t think the state of affairs warranted a Nuget package just yet, so this example is implemented as a script, in the Titanic branch of the project itself on GitHub.

First, let’s create a Record type to represent passengers:

type Passenger = {
    Id: string; 
    Class: string;
    Name: string;
    Sex: string;
    Age: string;
    SiblingsOrSpouse: string;
    ParentsOrChildren: string;
    Ticket: string;
    Fare: string;
    Cabin: string;
    Embarked: string }

Note that all the properties are represented as strings; it might be better to represent them for what they are (Age is a float, SiblingsOrSpouse an integer…) – but given that the dataset contains missing data, this would require dealing with that issue, perhaps using an Option type. We’ll dodge the problem for now, and opt for a stringly-typed representation.

Next, we need to construct a training set from the Kaggle data file. We’ll use the CSV parser that comes with FSharp.Data to extract the passengers from that list, as well as their known fate (the file is assumed to have been downloaded on your local machine first):

let path = @"C:\Users\Mathias\Documents\GitHub\Charon\Charon\Charon\train.csv"
let data = CsvFile.Load(path).Cache()

let trainingSet =
    [| for line in data.Data -> 
        line.GetColumn "Survived" |> Some, // the label
        {   Id = line.GetColumn "PassengerId"; 
            Class = line.GetColumn "Pclass";
            Name = line.GetColumn "Name";
            Sex = line.GetColumn "Sex";
            Age = line.GetColumn "Age";
            SiblingsOrSpouse = line.GetColumn "SibSp";
            ParentsOrChildren = line.GetColumn "Parch";
            Ticket = line.GetColumn "Ticket";
            Fare =line.GetColumn "Fare";
            Cabin = line.GetColumn "Cabin";
            Embarked = line.GetColumn "Embarked" } |]

Now that we have data, we can get to work, and define a model. We’ll start first with a regular Decision Tree, and extract only one feature, Sex:

let features = 
    [| (fun x -> x.Sex |> StringCategory); |]

What this is doing is defining an Array of features, a feature being a function which takes in a Passenger, and returns an Option string, via the utility StringCategory. StringCategory simply expects a string, and transforms a null or empty case into the “missing data” case, and otherwise treats the string as a Category. So in that case, x is a passenger, and if no Sex information is found, it will transform it into None, and otherwise into Some(“male”) or Some(“female”), the two cases that exist in the dataset.

We are now ready to go – we can run the algorithm and get a Decision Tree classifier, with a minimum leaf of 5 elements (i.e. we stop partitioning if we have less than 5 elements left):

let minLeaf = 5
let classifier = createID3Classifier trainingSet features minLeaf

… and we are done. How good is our classifier? Let’s check:

let correct = 
    trainingSet
    |> Array.averageBy (fun (label, obs) -> 
        if label = Some(classifier obs) then 1. else 0.)
printfn "Correct: %.4f" correct

More...

Comments

Comment RSS