Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 18. August 2012 03:39

This is the continuation of my series exploring Machine Learning, converting the code samples of “Machine Learning in Action” from Python to F# as I go through the book. Today’s post covers Chapter 4, which is dedicated to Naïve Bayes classification – and you can find the resulting code on GitHub.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading“Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

File:Thomas Bayes.gif

The idea behind the Algorithm

The canonical application of Bayes naïve classification is in text classification, where the goal is to identify to which pre-determined category a piece of text belongs to  – for instance, is this email I just received spam, or ham (“valuable” email)?

The underlying idea is to use individual words present in the text as indications for what category it is most likely to belong to, using Bayes Theorem, named after the cheerful-looking Reverend Bayes.

Imagine that you received an email containing the words “Nigeria”, “Prince”, “Diamonds” and “Money”. It is very likely that if you look into your spam folder, you’ll find quite a few emails containing these words, whereas, unless you are in the business of importing diamonds from Nigeria and have some aristocratic family, your “normal” emails would rarely contain these words. They have a much higher frequency within the category “Spam” than within the Ham, which makes them a potential flag for undesired business ventures.

On the other hand, let’s assume that you are a lucky person, and that typically, what you receive is Ham, with the occasional Spam bit. If you took a random email in your inbox, it is then much more likely that it belongs to the Ham category.

Bayes’ Theorem combines these two pieces of information together, to determine the probability that a particular email belongs to the “Spam” category, if it contains the word “Nigeria”:

P(is “Spam”|contains ”Nigeria”) = P(contains “Nigeria|is ”Spam”) x P(is “Spam”) / P(contains “Nigeria”)

In other words, 2 factors should be taken into account when deciding whether an email containing “Nigeria” is spam: how over-represented is that word in Spam, and how likely is it that any email is spammy in the first place?

The algorithm is named “Naïve”, because it makes a simplifying assumption about the text, which turns out to be very convenient for computations purposes, namely that each word appears with a frequency which doesn’t depend on other words. This is an unlikely assumption (the word “Diamond” is much more likely to be present in an email containing “Nigeria” than in your typical family-members discussion email).

We’ll leave it at that on the concepts –  I’ll refer the reader who want to dig deeper to the book, or to this explanation of text classification with Naïve Bayes.

A simple F# implementation

For my first pass, I took a slightly different direction from the book, and decided to favor readability over performance. I assume that we are operating on a dataset organized as a sequence of text samples, each of them labeled by category, along these lines (example from the book “Machine Learning in Action”):

Note: the code presented here can be found found on GitHub

let dataset =
    [| ("Ham",  "My dog has flea problems help please");
       ("Spam", "Maybe not take him to dog park stupid");
       ("Ham",  "My dalmatian is so cute I love him");
       ("Spam", "Stop posting stupid worthless garbage");
       ("Ham",  "Mr Licks ate my steak how to stop him");
       ("Spam", "Quit buying worthless dog food stupid") |]

We will need to do some word counting to compute frequencies, so let’s start with a few utility functions:

    open System
    open System.Text.RegularExpressions

    // Regular Expression matching full words, case insensitive.
    let matchWords = new Regex(@"\w+", RegexOptions.IgnoreCase)

    // Extract and count words from a string.
    // http://stackoverflow.com/a/2159085/114519        
    let wordsCount text =
        matchWords.Matches(text)
        |> Seq.cast<Match>
        |> Seq.groupBy (fun m -> m.Value)
        |> Seq.map (fun (value, groups) -> 
            value.ToLower(), (groups |> Seq.length))

    // Extracts all words used in a string.
    let vocabulary text =
        matchWords.Matches(text)
        |> Seq.cast<Match>
        |> Seq.map (fun m -> m.Value.ToLower())
        |> Seq.distinct

    // Extracts all words used in a dataset;
    // a Dataset is a sequence of "samples", 
    // each sample has a label (the class), and text.
    let extractWords dataset =
        dataset 
        |> Seq.map (fun sample -> vocabulary (snd sample))
        |> Seq.concat
        |> Seq.distinct

    // "Tokenize" the dataset: break each text sample
    // into words and how many times they are used.
    let prepare dataset =
        dataset
        |> Seq.map (fun (label, sample) -> (label, wordsCount sample))

We use a Regular Expression, \w+, to match all words, in a case-insensitive way. wordCount extracts individual words and the number of times they occur, while vocabulary simply returns the words encountered. The prepare function takes a complete dataset, and transforms each text sample into a Tuple containing the original classification label, and a Sequence of Tuples containing all lower-cased words found and their count.

More...

Comments

Comment RSS