Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
18. August 2012 03:39

This is the continuation of my series exploring Machine Learning, converting the code samples of “Machine Learning in Action” from Python to F# as I go through the book. Today’s post covers Chapter 4, which is dedicated to Naïve Bayes classification – and you can find the resulting code on GitHub.

Disclaimer: I am new to Machine Learning, and claim no expertise on the topic. I am currently reading“Machine Learning in Action”, and thought it would be a good learning exercise to convert the book’s samples from Python to F#.

## The idea behind the Algorithm

The canonical application of Bayes naïve classification is in text classification, where the goal is to identify to which pre-determined category a piece of text belongs to  – for instance, is this email I just received spam, or ham (“valuable” email)?

The underlying idea is to use individual words present in the text as indications for what category it is most likely to belong to, using Bayes Theorem, named after the cheerful-looking Reverend Bayes.

Imagine that you received an email containing the words “Nigeria”, “Prince”, “Diamonds” and “Money”. It is very likely that if you look into your spam folder, you’ll find quite a few emails containing these words, whereas, unless you are in the business of importing diamonds from Nigeria and have some aristocratic family, your “normal” emails would rarely contain these words. They have a much higher frequency within the category “Spam” than within the Ham, which makes them a potential flag for undesired business ventures.

On the other hand, let’s assume that you are a lucky person, and that typically, what you receive is Ham, with the occasional Spam bit. If you took a random email in your inbox, it is then much more likely that it belongs to the Ham category.

Bayes’ Theorem combines these two pieces of information together, to determine the probability that a particular email belongs to the “Spam” category, if it contains the word “Nigeria”:

P(is “Spam”|contains ”Nigeria”) = P(contains “Nigeria|is ”Spam”) x P(is “Spam”) / P(contains “Nigeria”)

In other words, 2 factors should be taken into account when deciding whether an email containing “Nigeria” is spam: how over-represented is that word in Spam, and how likely is it that any email is spammy in the first place?

The algorithm is named “Naïve”, because it makes a simplifying assumption about the text, which turns out to be very convenient for computations purposes, namely that each word appears with a frequency which doesn’t depend on other words. This is an unlikely assumption (the word “Diamond” is much more likely to be present in an email containing “Nigeria” than in your typical family-members discussion email).

We’ll leave it at that on the concepts –  I’ll refer the reader who want to dig deeper to the book, or to this explanation of text classification with Naïve Bayes.

## A simple F# implementation

For my first pass, I took a slightly different direction from the book, and decided to favor readability over performance. I assume that we are operating on a dataset organized as a sequence of text samples, each of them labeled by category, along these lines (example from the book “Machine Learning in Action”):

Note: the code presented here can be found found on GitHub

let dataset =
[| ("Ham",  "My dog has flea problems help please");
("Spam", "Maybe not take him to dog park stupid");
("Ham",  "My dalmatian is so cute I love him");
("Spam", "Stop posting stupid worthless garbage");
("Ham",  "Mr Licks ate my steak how to stop him");
("Spam", "Quit buying worthless dog food stupid") |]

We will need to do some word counting to compute frequencies, so let’s start with a few utility functions:

    open System
open System.Text.RegularExpressions

// Regular Expression matching full words, case insensitive.
let matchWords = new Regex(@"\w+", RegexOptions.IgnoreCase)

// Extract and count words from a string.
// http://stackoverflow.com/a/2159085/114519
let wordsCount text =
matchWords.Matches(text)
|> Seq.cast<Match>
|> Seq.groupBy (fun m -> m.Value)
|> Seq.map (fun (value, groups) ->
value.ToLower(), (groups |> Seq.length))

// Extracts all words used in a string.
let vocabulary text =
matchWords.Matches(text)
|> Seq.cast<Match>
|> Seq.map (fun m -> m.Value.ToLower())
|> Seq.distinct

// Extracts all words used in a dataset;
// a Dataset is a sequence of "samples",
// each sample has a label (the class), and text.
let extractWords dataset =
dataset
|> Seq.map (fun sample -> vocabulary (snd sample))
|> Seq.concat
|> Seq.distinct

// "Tokenize" the dataset: break each text sample
// into words and how many times they are used.
let prepare dataset =
dataset
|> Seq.map (fun (label, sample) -> (label, wordsCount sample))

We use a Regular Expression, \w+, to match all words, in a case-insensitive way. wordCount extracts individual words and the number of times they occur, while vocabulary simply returns the words encountered. The prepare function takes a complete dataset, and transforms each text sample into a Tuple containing the original classification label, and a Sequence of Tuples containing all lower-cased words found and their count.

More...

21. August 2010 16:38

In a previous post, we saw how to programmatically search for text in a PowerPoint slide, by iterating over the Shapes contained in a slide, finding the ones that have a TextFrame, and accessing their TextRange property. TextRange exposes a Text property, which “represents the text contained in the specified object”.

Our goal is to translate a slide from a language to another, which means translating every chunk of text we find. However, the Text property contains a bit more than just text. Suppose you were working with a slide like the one below, which contains multiple bullet points, with various indentations:

If you inspect the Text for the content area, you’ll see that it looks like this:

Work It\rMake It\rDo It\rMakes Us\rHarder\rBetter\rFaster\rStronger

At the end of each bullet point, we have a \r, which indicates a line break. If we want to maintain the formatting of our slide when we translate it, we’ll have to deal with it.

We’ll worry about the actual  translation later – for the moment we will use a fake method, which will show us what chunk of text has been translated:

public static string Translate(string text)
{
return "Translated [" + text + "]";
}

## A crude approach

A first approach would be to simply take the entire Text we find in the TextRange, manually separate it into chunks by splitting it around the carriage return character, translating the chunk, and re-composing the text, re-inserting the carriage returns.

Starting where we left off last time, let’s loop over the Shapes in the slide:

private void TranslateSlide()
{
var presentation = powerpoint.ActivePresentation;
var slide = (PowerPoint.Slide)powerpoint.ActiveWindow.View.Slide;
foreach (PowerPoint.Shape shape in slide.Shapes)
{
if (shape.HasTextFrame == Microsoft.Office.Core.MsoTriState.msoTrue)
{
var textFrame = shape.TextFrame;
var textRange = textFrame.TextRange;
var text = textRange.Text;
textRange.Text = CrudeApproach(text);
}
}
}

More...

8. August 2010 17:30

I am currently on a project which involves creating a PowerPoint VSTO add-in. I have very limited experience with PowerPoint automation, so before committing to the project, I thought it would be a good idea to explore a bit the object model, to gauge how difficult things could get, and I set to write a small PowerPoint add-in which would automatically translate slides. Sounds like a simple enough project, how difficult could it be?

Turns out, not too difficult, but not completely trivial either. I discovered quickly that the PowerPoint object model, unlike most Office applications, doesn’t have much (any?) documentation for the .Net developer; the best I found is the VBA PowerPoint 2007 developer reference, which gives a decent starting point to figure out what the objects are about. So I thought I would share my exploration of the PowerPoint jungle, and hopefully spare some trouble to other .Net developers.

## The plan

The objective is simple: write an add-in which allows the user to

• select a language to translate from, and a language to translate to,
• create a duplicate of the current slide, translating all the text and keeping the layout

The plan will be to use Google Translate to perform the translation. In order to do that, we will nedd to extract out all pieces of text that require translating.

## Finding all the text in a slide

Lets’ start by identifying where we have text in the current slide. Let’s first create a PowerPoint 2007 Add-in project in Visual Studio. To keep things simple for now, we will add a Ribbon control with a button, and when that button is clicked, we’ll start working on the current slide:

Double-click on the Button (I renamed my button translateButton) to generate an event handler for the Click event, and get the current Slide:

private void translateButton_Click(object sender, RibbonControlEventArgs e)
{
if (powerpoint.ActivePresentation.Slides.Count > 0)
{
var slide = (PowerPoint.Slide)powerpoint.ActiveWindow.View.Slide as PowerPoint.Slide;
}
}

More...

#### Need help with F#?

The premier team for
F# training & consulting.