Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
18. August 2012 03:39

This is the continuation of my series exploring Machine Learning, converting the code samples of “Machine Learning in Action” from Python to F# as I go through the book. Today’s post covers Chapter 4, which is dedicated to Naïve Bayes classification – and you can find the resulting code on GitHub.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading“Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

## The idea behind the Algorithm

The canonical application of Bayes naïve classification is in text classification, where the goal is to identify to which pre-determined category a piece of text belongs to  – for instance, is this email I just received spam, or ham (“valuable” email)?

The underlying idea is to use individual words present in the text as indications for what category it is most likely to belong to, using Bayes Theorem, named after the cheerful-looking Reverend Bayes.

Imagine that you received an email containing the words “Nigeria”, “Prince”, “Diamonds” and “Money”. It is very likely that if you look into your spam folder, you’ll find quite a few emails containing these words, whereas, unless you are in the business of importing diamonds from Nigeria and have some aristocratic family, your “normal” emails would rarely contain these words. They have a much higher frequency within the category “Spam” than within the Ham, which makes them a potential flag for undesired business ventures.

On the other hand, let’s assume that you are a lucky person, and that typically, what you receive is Ham, with the occasional Spam bit. If you took a random email in your inbox, it is then much more likely that it belongs to the Ham category.

Bayes’ Theorem combines these two pieces of information together, to determine the probability that a particular email belongs to the “Spam” category, if it contains the word “Nigeria”:

P(is “Spam”|contains ”Nigeria”) = P(contains “Nigeria|is ”Spam”) x P(is “Spam”) / P(contains “Nigeria”)

In other words, 2 factors should be taken into account when deciding whether an email containing “Nigeria” is spam: how over-represented is that word in Spam, and how likely is it that any email is spammy in the first place?

The algorithm is named “Naïve”, because it makes a simplifying assumption about the text, which turns out to be very convenient for computations purposes, namely that each word appears with a frequency which doesn’t depend on other words. This is an unlikely assumption (the word “Diamond” is much more likely to be present in an email containing “Nigeria” than in your typical family-members discussion email).

We’ll leave it at that on the concepts –  I’ll refer the reader who want to dig deeper to the book, or to this explanation of text classification with Naïve Bayes.

## A simple F# implementation

For my first pass, I took a slightly different direction from the book, and decided to favor readability over performance. I assume that we are operating on a dataset organized as a sequence of text samples, each of them labeled by category, along these lines (example from the book “Machine Learning in Action”):

Note: the code presented here can be found found on GitHub

let dataset =
[| ("Ham",  "My dog has flea problems help please");
("Spam", "Maybe not take him to dog park stupid");
("Ham",  "My dalmatian is so cute I love him");
("Spam", "Stop posting stupid worthless garbage");
("Ham",  "Mr Licks ate my steak how to stop him");
("Spam", "Quit buying worthless dog food stupid") |]

We will need to do some word counting to compute frequencies, so let’s start with a few utility functions:

    open System
open System.Text.RegularExpressions

// Regular Expression matching full words, case insensitive.
let matchWords = new Regex(@"\w+", RegexOptions.IgnoreCase)

// Extract and count words from a string.
// http://stackoverflow.com/a/2159085/114519
let wordsCount text =
matchWords.Matches(text)
|> Seq.cast<Match>
|> Seq.groupBy (fun m -> m.Value)
|> Seq.map (fun (value, groups) ->
value.ToLower(), (groups |> Seq.length))

// Extracts all words used in a string.
let vocabulary text =
matchWords.Matches(text)
|> Seq.cast<Match>
|> Seq.map (fun m -> m.Value.ToLower())
|> Seq.distinct

// Extracts all words used in a dataset;
// a Dataset is a sequence of "samples",
// each sample has a label (the class), and text.
let extractWords dataset =
dataset
|> Seq.map (fun sample -> vocabulary (snd sample))
|> Seq.concat
|> Seq.distinct

// "Tokenize" the dataset: break each text sample
// into words and how many times they are used.
let prepare dataset =
dataset
|> Seq.map (fun (label, sample) -> (label, wordsCount sample))

We use a Regular Expression, \w+, to match all words, in a case-insensitive way. wordCount extracts individual words and the number of times they occur, while vocabulary simply returns the words encountered. The prepare function takes a complete dataset, and transforms each text sample into a Tuple containing the original classification label, and a Sequence of Tuples containing all lower-cased words found and their count.

More...

5. August 2012 13:50

Today’s topic will be Chapter 3 of “Machine Learning in Action”, which covers Decision Trees.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading“Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

The idea behind Decision Trees is similar to the Game of 20 Questions: construct a set of discrete Choices to identify the Class of an item. We will use the following dataset for illustration: imagine that we have 5 cards, each with a major masterpiece of contemporary cinema, classified by genre. Now I hide one – and you can ask 2 questions about the genre of the movie to identify the Thespian luminary in the lead role, in as few questions as possible:

 Action Sci-Fi Actor Cliffhanger Yes No Stallone Rocky Yes No Stallone Twins No No Schwarzenegger Terminator Yes Yes Schwarzenegger Total Recall Yes Yes Schwarzenegger

The questions you would likely ask are:

• Is this a Sci-Fi movie? If yes, Arnold is the answer, if no,
• Is this an Action movie? if yes, go for Sylvester, otherwise Arnold it is.

That’s a Decision Tree in a nutshell: we traverse a Tree, asking about features, and depending on the answer, we draw a conclusion or recurse deeper into more questions. The goal today is to let the computer build the right tree from the dataset, and use that tree to classify “subjects”.

## Defining a Tree

Let’s start with the end – the Tree. A common and convenient way to model Trees in F# is to use a discriminated union like this:

type Tree =
| Conclusion of string
| Choice of string * (string * Tree) []

A Tree is composed of either a Conclusion, described by a string, or a Choice, which is described by a string, and an Array of multiple options, each described by a string and its own Tree, “tupled”.

For instance, we can manually create a tree for our example like this:

let manualTree =
Choice
("Sci-Fi",
[|("No",
Choice
("Action",
[|("Yes", Conclusion "Stallone");
("No", Conclusion "Schwarzenegger")|]));
("Yes", Conclusion "Schwarzenegger")|])

Our tree starts with a Choice, labeled “Sci-Fi”, with 2 options in an Array, “No” or “Yes”. “Yes” leads to a Conclusion (a Leaf node), Arnold, while “No” opens another Choice, “Action”, with 2 Conclusions.

So how can we use this to Classify a “Subject”? We need to traverse down the Tree, check what branch corresponds to the Subject for the current Choice, and continue until we reach a Decision node, at what point we can return the contents of the Conclusion. To that effect, we’ll represent a “Subject” (the thing we are trying to classify) as an collection of Tuples, each Tuple being a key/value pair, representing a Feature and its value:

let test = [| ("Action", "Yes"); ("Sci-Fi", "Yes") |]

We are ready to write a classification function now:

let rec classify subject tree =
match tree with
| Conclusion(c) -> c
| Choice(label, options) ->
let subjectState =
subject
|> Seq.find(fun (key, value) -> key = label)
|> snd
options
|> Array.find (fun (option, tree) -> option = subjectState)
|> snd
|> classify subject

classify is a recursive function: given a subject and a tree, if the Tree is a Conclusion, we are done, otherwise, we retrieve the label of the next Choice, find the value of the Subject for that Choice, and use it to pick the next level of the Tree.

At that point, using the Tree to classify our subject is as simple as:

> let actor = classify test manualTree;;

val actor : string = "Schwarzenegger"

Not bad for 14 lines of code. The most painful part is the manual construction of the Tree – let’s see if we can get the computer to build that for us.

More...

1. August 2012 17:27

This is the continuation of this post, where I began exploring k-nearest-neighbor classification, a Machine Learning algorithm used to classify items in different categories, based on an existing sample of items that have been properly classified.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading “Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

To determine what category an item belongs to, the algorithm measures its distance from each element of the known dataset, and takes a “majority vote” to determine its category, based on its k nearest neighbors.

In our last installment, we wrote a simple classification function, which was doing just that. Today, we’ll continue along Chapter 2, applying our algorithm to real data, and dealing with data normalization.

## The dataset: Elections 2008

Rather than use the data sample from the book, I figured it would be more fun to create my own data set. The problem we’ll look into is whether we can predict whether a state voted Republican or Democrat in the 2008 presidential election. Our dataset will consist of the following: the Latitude and Longitude of the State, and its Population. A State is classified as Democrat if the number of votes (i.e. popular vote) recorded for Obama was greater than McCain.

Notes

• My initial goal was to do this for Cities, but I couldn’t find voting data at that level – so I had to settle for States. The resulting sample is smaller than I would have liked, and also less useful (we can’t realistically use our classifier to produce a prediction for a new State, because the likelihood of a new State joining the Union is fairly small) but it still works well for illustration purposes.
• Computing distance based on raw Latitude and Longitude wouldn’t be such a great idea in general, because they denote a position on a sphere (very close points may have very different Longitudes); however, given the actual layout of the United States, this will be good enough here.

I gathered the data (see sources at the end) and saved it in a comma-delimited text file “Election2008.txt” (the raw text version is available at the bottom of the post).

First, we need to open that file and parse it into the structure we used in our previous post, with a matrix of observations for each state, and a vector of categories (DEM or REP). That’s easy enough with F#:

let elections =
let file = @"C:\Users\Mathias\Desktop\Elections2008.txt"
let fileAsLines =
|> Array.map (fun line -> line.Split(','))
let dataset =
fileAsLines
|> Array.map (fun line ->
[| Convert.ToDouble(line.[1]);
Convert.ToDouble(line.[2]);
Convert.ToDouble(line.[3]) |])
let labels = fileAsLines |> Array.map (fun line -> line.[4])
dataset, labels

We open System and System.IO, and use File.ReadAllLines, which returns an array of strings for each line in the text file, and apply string.Split to each line, using the comma as a split-delimiter, which gives us an array of string for each text line. For each row, we then retrieve the dataset part, elements 1, 2 and 3 (the Latitude, Longitude and Population), creating an Array of doubles by converting the strings to doubles. Finally, we retrieve the fourth element of each row, the vote, into an array of labels, and return dataset and labels as a tuple.

Let’s visualize the result, using the FSharpChart display function we wrote last time. display dataset 1 0 plots Longitude on the X axis, and Latitude on the Y axis, producing a somewhat unusual map of the United States, where we can see the red states / blue states pattern (colored differently), with center states leaning Republican, and Coastal states Democrat:

Plotting Longitude against Population produces the following chart, which also displays some clusters:

Finally, Longitude versus Population also exhibits some patterns:

## Normalizing the data

We could run the algorithm we wrote in the previous post on the data, and measure the distances between observations based on the raw measures, but this would likely produce poor results, because of the discrepancy in scales. Our Latitudes range from about 20 (Hawaii) to 60 (Alaska), while Populations vary from half a million (Washington DC) to over 30 millions (California). Because we compute the distance between observations using the Euclidean distance, differences in Population will have a huge weight in the overall distance, and in effect we would be “ignoring” Latitude and Longitude in our classification.

Consider this: the raw distance between 2 states is given by

Distance (S1, S2) = sqrt ((Pop(S1) – Pop(S2))^2 + (Lat(S1) – Lat(S2))^2 + (Lon(S1) – Lon(S2))^2)

Even if we considered 2 states located as far as possible from each other, at the 2 corners of our map, the distance would look like:

Distance (S1, S2) = sqrt ((Pop(S1) – Pop(S2))^2 + 40^2 + 90^2)

It’s clear that even minimal differences in population in this formula (say, 100,000 inhabitants) will completely dwarf the largest possible effect of geographic distance. If we want to observe the effect of all three dimensions, we need to convert the measurements to a comparable scale, so that differences in Population are comparable, relatively, to differences in Location.

To that effect, we will Normalize our dataset. We’ll follow the example of “Machine Learning in Action”, and normalize each measurement so that its minimal value is 0, and its maximum is 1. Other approaches would be feasible, but this one has the benefit of being fairly straightforward. If we consider Population for instance, what we need to do is:

• Retrieve all the Populations in our Dataset
• Retrieve the minimum and the maximum
• Transform the Population to (Population – minimum) / (maximum – minimum)

Here is what I came up with:

let column (dataset: float [][]) i =
dataset |> Array.map (fun row -> row.[i])

let columns (dataset: float [][]) =
let cols = dataset.[0] |> Array.length
[| for i in 0 .. (cols - 1) -> column dataset i |]

let minMax dataset =
dataset
|> columns
|> Array.map (fun col -> Array.min(col), Array.max(col))

let minMaxNormalizer dataset =
let bounds = minMax dataset
fun (vector: float[]) ->
Array.mapi (fun i v ->
(vector.[i] - fst v) / (snd v - fst v)) bounds

Compared to the Python example, I have to pay a bit of a code tax here, because I chose to use plain F# arrays to model my dataset, instead of using matrices. The column function extracts column i from a dataset, by mapping each row (an observation) to its ith component. columns expands on it, and essentially transposes the matrix, converting it from an array of row vectors to an array of column vectors.

More...

29. July 2012 12:43

With all sorts of people waving around the term “Machine Learning” lately, I figured it was time to look into what the fuss was about, so I purchased “Machine Learning In Action”. I am mostly enjoying the book so far, with one inconvenience: all the code presented is in Python, which is easy enough to follow, but not directly useful to me. The best way to learn is to get your hands dirty and code, so I am planning on converting the Python examples into F# as a progress through – which should also be a good exercise in learning more F#.

Chapter 2 of the book covers classification using k-Nearest Neighbors. The idea behind the algorithm is fairly straightforward: given a dataset of numeric observations, each observation being classified in a group, the algorithm will classify a new observation based on what group most of its close neighbors belong to.

The book uses a linear algebra library in its implementation. It seemed like overkill for the situation, so I’ll go for raw F# here.

Let’s first create a new F# library project, and start working on a script, creating a fictional dataset like this:

let createDataSet =
[| [| 1.0; 0.9 |]
[| 0.8; 1.0 |]
[| 0.8; 0.9 |]
[| 0.0; 0.1 |]
[| 0.3; 0.0 |]
[| 0.1; 0.1 |] |],
[| "A"; "A"; "A"; "B"; "B"; "B" |]

createDataSet returns a Tuple with two elements. First, we create an Array of Arrays, an Array containing 6 observations on 2 fictional variables. The second element is also an Array, containing the Label of the group the observation belongs to. For instance, the first observation was [ 1.0; 0.9 ], and it belonged to group A.

It would be helpful to visualize the dataset, to get a sense for the structure of the data. One option to do this is FSharpChart, a lightweight charting library which works fairly well with the F# interactive window. The easiest way to use it is by adding it to our project via NuGet, which adds a reference to MSDN.FSharpChart. We need to add a reference to FSharpChart to the script, with a reference to the path where NuGet downloaded the libraries (this post by Don Syme provides a great example) – we are now ready to add a scatterplot function to the script:

#r @"C:\Users\Mathias\Documents\Visual Studio 2010\Projects\MachineLearningInAction\packages\MSDN.FSharpChart.dll.0.60\lib\MSDN.FSharpChart.dll"

open MSDN.FSharp.Charting

let createDataSet =
[| [| 1.0; 0.9 |]
[| 0.8; 1.0 |]
[| 0.8; 0.9 |]
[| 0.0; 0.1 |]
[| 0.3; 0.0 |]
[| 0.1; 0.1 |] |],
[| "A"; "A"; "A"; "B"; "B"; "B" |]

let scatterplot (dataset: float[][]) =
dataset
|> Array.map (fun e -> e.[0], e.[1])
|> FSharpChart.FastPoint
|> FSharpChart.Create

The scatterplot simply takes a dataset, maps each observation to a tuple of X and Y coordinates, and passes it to FSharpChart.FastPoint, which produces a… scatterplot. Let’s select all that code, send it to F# interactive, and start playing in fsi:

> let data, labels = createDataSet
scatterplot data;;

val labels : string [] = [|"A"; "A"; "A"; "B"; "B"; "B"|]
val data : float [] [] =
[|[|1.0; 0.9|]; [|0.8; 1.0|]; [|0.8; 0.9|]; [|0.0; 0.1|]; [|0.3; 0.0|];
[|0.1; 0.1|]|]

At that point, you should see a chart popping up, looking like this one:

More...

27. June 2012 15:06

This months’ issue of MSDN Magazine has an interesting piece on evolutionary algorithms. The article applies a genetic algorithm to identify the minimum value of a “pathological” continuous function, the Schwefel function.

The Schwefel function

For X and Y values between –500 and 500, the “correct” answer is X and Y = 420.9687.

This function is designed to give fits to optimization algorithms. The issue here is that  the function has numerous peaks and valleys. As a result, if the search strategy is to use some form of gradient approach, that is, from a starting point, follow the steepest descent until a minimum is reached, there is a big risk to end up in a place which is a local minimum: the algorithm gets stuck in a valley with no path downwards, but there are other, better solutions around, which can be reached only by “climbing out of the hole” and temporarily following a path which heads in the wrong direction.

Out of curiosity, I checked how the Excel Solver would fare on this problem:

The result was an abject failure – not even close to the true solution.

I thought it would be interesting to see how Bumblebee, my Artificial Bee Colony framework, would perform. There are some general similarities between the underlying ideas behind the articles’ algorithm and Bumblebee, the main difference being that Bumblebee simply mutates individual solutions, and doesn’t create “crossover solutions”.

Let’s open Visual Studio, create an F# Console project, grab the NuGet package for Bumblebee – and start coding.

As usual, we need 4 elements to leverage Bumblebee – a Type of Solution, and 3 functions: a Generator, which returns a brand-new, random solution, a Mutator, which transforms a known solution into a new, similar solution, and an Evaluator, which evaluates a solution and returns a float, increasing with the quality of the solution.

In this case, the Solution type is fairly straightforward. We are looking for 2 floats x and y, so we’ll go for a Tuple. Similarly, the Evaluation is a given, it is the negative of the Schwefel function. The reason we go for the negative is that Bumblebee will try to maximize the Evaluation, so if we are looking for a minimum, we need to reverse the sign – because the Minimum of a function is the Maximum of its negative.

let schwefel x =
-x * Math.Sin(Math.Sqrt(Math.Abs(x)))

let evaluate (x, y) =
- schwefel (x) - schwefel (y)

The Generation function is also trivial – we’ll simply pick random pairs of floats in the [ –500.0; 500.0 ] interval:

let min = -500.0
let max = 500.0

let generate (rng: Random) = (
rng.NextDouble() * (max - min) + min,
rng.NextDouble() * (max - min) + min)

The Mutation function takes a tiny bit more of effort. The idea I followed is essentially the same as the one used in the article: given a solution defined by a pair of floats, randomly decide if any of the elements will be mutated, and if yes, add a random permutation to that element, scaled to the precision we want to achieve:

let precision = 0.00001
let rate = 0.5

let mutate (rng: Random) solution =
let (x, y) = solution
let x =
if rng.NextDouble() < rate
then x + precision * ((max - min) * rng.NextDouble() + min)
else x
let y =
if rng.NextDouble() < rate
then y + precision * ((max - min) * rng.NextDouble() + min)
else y
(x, y)

More...

#### Need help with F#?

The premier team for
F# training & consulting.