Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
12. April 2014 11:05

A lightweight post this week. One of my favorite F# type providers is the World Bank type provider, which enables ridiculously easy access to a boatload of socio-economic data for every country in the world. However, numbers are cold – wouldn’t it be nice to visualize them using a map? Turns out it’s pretty easy to do, using another of my favorites, the R type provider. The rworldmap R package, as its name suggests, is all about world maps, and is a perfect fit with the World Bank data.

The video below shows you the results in action; I also added the code below, for good measure. The only caveat relates to the integration between the Deedle data frame library and R. I had to manually copy the Deedle.dll and Deedle.RProvider.Plugin.dll into packages\RProvider.1.0.5\lib for the R Provider to properly convert Deedle data frames into R data frames. Enjoy!

Here is the script I used:

#I @"..\packages\"
#r @"R.NET.1.5.5\lib\net40\RDotNet.dll"
#r @"RProvider.1.0.5\lib\RProvider.dll"
#r @"FSharp.Data.2.0.5\lib\net40\FSharp.Data.dll"
#r @"Deedle.0.9.12\lib\net40\Deedle.dll"
#r @"Deedle.RPlugin.0.9.12\lib\net40\Deedle.RProvider.Plugin.dll"

open FSharp.Data
open RProvider
open RProvider.base
open Deedle
open Deedle.RPlugin
open RProviderConverters

let wb = WorldBankData.GetDataContext()
wb.Countries.France.CapitalCity
wb.Countries.France.Indicators.Population (Total).[2000]

let countries = wb.Countries

let pop2000 = series [ for c in countries -> c.Code => c.Indicators.Population (Total).[2000]]
let pop2010 = series [ for c in countries -> c.Code => c.Indicators.Population (Total).[2010]]
let surface = series [ for c in countries -> c.Code => c.Indicators.Surface area (sq. km).[2010]]

let df = frame [ "Pop2000" => pop2000; "Pop2010" => pop2010; "Surface" => surface ]
df?Codes <- df.RowKeys

open RProvider.rworldmap

let map = R.joinCountryData2Map(df,"ISO3","Codes")
R.mapCountryData(map,"Pop2000")

df?Density <- df?Pop2010 / df?Surface
df?Growth <- (df?Pop2010 - df?Pop2000) / df?Pop2000

let map2 = R.joinCountryData2Map(df,"ISO3","Codes")
R.mapCountryData(map2,"Density")
R.mapCountryData(map2,"Growth")

Have a great week-end, everybody! And big thanks to Tomas for helping me figure out a couple of things about Deedle.

22. March 2014 13:11

A couple of days ago, I got into the following Twitter exchange:

So why do I think FsCheck + XUnit = The Bomb?

I have a long history with Test-Driven Development; to this day, I consider Kent Beck’s “Test-Driven Development by Example” one of the biggest influences in the way I write code (any terrible code I might have written is, of course, to be blamed entirely on me, and not on the book).

In classic TDD style, you typically proceed by writing incremental test cases which match your requirements, and progressively write the code that will satisfy the requirements. Let’s illustrate on an example, a password strength validator. Suppose that my requirements are “a password must be at least 8 characters long to be valid”. Using XUnit, I would probably write something along these lines:

namespace FSharpTests

open Xunit
open CSharpCode

module Password validator tests =

[<Fact>]
let length above 8 should be valid () =
let validator = Validator ()


… and in the CSharpCode project, I would then write the dumbest minimal implementation that could passes that requirement, that is:

public class Validator
{
{
return true;
}
}


Next, I would write a second test, to verify the obvious negative:

namespace FSharpTests

open Xunit
open CSharpCode

module Password validator tests =

[<Fact>]
let length above 8 should be valid () =
let validator = Validator ()

[<Fact>]
let length under 8 should not be valid () =
let validator = Validator ()


This fails, producing the following output in Visual Studio:

… which forces me to fix my implementation, for instance like this:

public class Validator
{
{
{
return false;
}

return true;
}
}


Let’s pause here for a couple of remarks. First, note that while my tests are written in F#, the code base I am testing against is in C#. Mixing the two languages in one solution is a non-issue. Then, after years of writing C# test cases with names like Length_Above_8 _Should_Be_Valid, and arguing whether this was better or worse than LengthAbove8 ShouldBeValid, I find that having the ability to simply write “length above 8 should be valid”, in plain old English (and seeing my tests show that way in the test runner as well), is pleasantly refreshing. For that reason alone, I would encourage F#-curious C# developers to try out writing tests in F#; it’s a nice way to get your toes in the water, and has neat advantages.

But that’s not the main point I am interested here. While this process works, it is not without issues. From a single requirement, “a password must be at least 8 characters long to be valid”, we ended up writing 2 test cases. First, the cases we ended up are somewhat arbitrary, and don’t fully reflect what they say. I only tested two instances, one 7 characters long, one 8 characters long. This is really relying on my ability as a developer to identify “interesting cases” in a vast universe of possible passwords, hoping that I happened to cover sufficient ground.

This is where FsCheck comes in. FsCheck is a port of Haskell’s QuickCheck, a property-based testing framework. The term “property” is somewhat overloaded, so let’s clarify: what “Property” means in that context is a property of our program that should be true, in the same sense as mathematically, a property of any number x is “x * x is positive”. It should always be true, for any input x.

Install FsCheck via Nuget, as well as the FsCheck XUnit extension; you can now write tests that verify properties by marking them with the attribute [<Property>], instead of [<Fact>], and the XUnit test runner will pick them up as normal tests. For instance, taking our example from right above, we can write:

namespace FSharpTests

open Xunit
open FsCheck
open FsCheck.Xunit
open CSharpCode

module Specification =

[<Property>]
let square should be positive (x:float) =
x * x > 0.


Let’s run that – fail. If you click on the test results, here is what you’ll see:

FsCheck found a counter-example, 0.0. Ooops! Our specification is incorrect here, the square value doesn’t have to be strictly positive, and could be zero. This is an obvious mistake, let’s fix the test, and get on with our lives:

[<Property>]
let square should be positive (x:float) =
x * x >= 0.


More...

1. March 2014 14:32

During some recent meanderings through the confines of the internet, I ended up discovering the Winnow Algorithm. The simplicity of the approach intrigued me, so I thought it would be interesting to try and implement it in F# and see how well it worked.

The purpose of the algorithm is to train a binary classifier, based on binary features. In other words, the goal is to predict one of two states, using a collection of features which are all binary. The prediction model assigns weights to each feature; to predict the state of an observation, it checks all the features that are “active” (true), and sums up the weights assigned to these features. If the total is above a certain threshold, the result is true, otherwise it’s false. Dead simple – and so is the corresponding F# code:

type Observation = bool []
type Label = bool
type Example = Label * Observation
type Weights = float []

let predict (theta:float) (w:Weights) (obs:Observation) =
(obs,w) ||> Seq.zip
|> Seq.filter fst
|> Seq.sumBy snd
|> ((<) theta)

We create some type aliases for convenience, and write a predict function which takes in theta (the threshold), weights and and observation; we zip together the features and the weights, exclude the pairs where the feature is not active, sum the weights, check whether the threshold is lower that the total, and we are done.

In a nutshell, the learning process feeds examples (observations with known label), and progressively updates the weights when the model makes mistakes. If the current model predicts the output correctly, don’t change anything. If it predicts true but should predict false, it is over-shooting, so weights that were used in the prediction (i.e. the weights attached to active features) are reduced. Conversely, if the prediction is false but the correct result should be true, the active features are not used enough to reach the threshold, so they should be bumped up.

And that’s pretty much it – the algorithm starts with arbitrary initial weights of 1 for every feature, and either doubles or halves them based on the mistakes. Again, the F# implementation is completely straightforward. The weights update can be written as follows:

let update (theta:float) (alpha:float) (w:Weights) (ex:Example) =
let real,obs = ex
match (real,predict theta w obs) with
| (true,false) -> w |> Array.mapi (fun i x -> if obs.[i] then alpha * x else x)
| (false,true) -> w |> Array.mapi (fun i x -> if obs.[i] then x / alpha else x)
| _ -> w

Let’s check that the update mechanism works:

> update 0.5 2. [|1.;1.;|] (false,[|false;true;|]);;
val it : float [] = [|1.0; 0.5|]

The threshold is 0.5, the adjustment multiplier is 2, and each feature is currently weighted at 1. The state of our example is [| false; true; |], so only the second feature is active, which means that the predicted value will be 1. (the weight of that feature). This is above the threshold 0.5, so the predicted value is true. However, because the correct value attached to that example is false, our prediction is incorrect, and the weight of the second feature is reduced, while the first one, which was not active, remains unchanged.

Let’s wrap this up in a convenience function which will learn from a sequence of examples, and give us directly a function that will classify observations:

let learn (theta:float) (alpha:float) (fs:int) (xs:Example seq) =
let updater = update theta alpha
let w0 = [| for f in 1 .. fs -> 1. |]
let w = Seq.fold (fun w x -> updater w x) w0 xs
fun (obs:Observation) -> predict theta w obs

We pass in the number of features, fs, to initialize the weights at the correct size, and use a fold to update the weights for each example in the sequence. Finally, we create and return a function that, given an observation, will predict the label, based on the weights we just learnt.

And that’s it – in 20 lines of code, we are done, the Winnow is implemented.

More...

15. February 2014 12:51

My favorite column in MSDN Magazine is Test Run; it was originally focused on testing, but the author, James McCaffrey, has been focusing lately on topics revolving around numeric optimization and machine learning, presenting a variety of methods and approaches. I quite enjoy his work, with one minor gripe –his examples are all coded in C#, which in my opinion is really too bad, because the algorithms would gain much clarity if written in F# instead.

Back in June 2013, he published a piece on Amoeba Method Optimization using C#. I hadn’t seen that approach before, and found it intriguing. I also found the C# code a bit too hairy for my feeble brain to follow, so I decided to rewrite it in F#.

In a nutshell, the Amoeba approach is a heuristic to find the minimum of a function. Its proper respectable name is the Nelder-Nead method. The reason it is also called the Amoeba method is because of the way the algorithm works: in its simple form, it starts from a triangle, the “Amoeba”; at each step, the Amoeba “probes” the value of 3 points in its neighborhood, and moves based on how much better the new points are. As a result, the triangle is iteratively updated, and behaves a bit like an Amoeba moving on a surface.

Before going into the actual details of the algorithm, here is how my final result looks like. You can find the entire code here on GitHub, with some usage examples in the Sample.fsx script file. Let’s demo the code in action: in a script file, we load the Amoeba code, and use the same function the article does, the Rosenbrock function. We transform the function a bit, so that it takes a Point (an alias for an Array of floats, essentially a vector) as an input, and pass it to the solve function, with the domain where we want to search, in that case, [ –10.0; 10.0 ] for both x and y:

#load "Amoeba.fs"

open Amoeba
open Amoeba.Solver

let g (x:float) y =
100. * pown (y - x * x) 2 + pown (1. - x) 2

let testFunction (x:Point) =
g x.[0] x.[1]

solve Default [| (-10.,10.); (-10.,10.) |] testFunction 1000

Running this in the F# interactive window should produce the following:

val it : Solution = (0.0, [|1.0; 1.0|])
>

The algorithm properly identified that the minimum is 0, for a value of x = 1.0 and y = 1.0. Note that results may vary: this is a heuristic, which starts with a random initial amoeba, so each run could produce slightly different results, and might at times epically fail.

More...

2. February 2014 08:08

tl/dr: Community for F# has a brand-new page at www.c4fsharp.net – with links to a ton of recorded F# presentations, as well as F# hands-on Dojos and material. Check it out, and let us know on Twitter what you think, and what you want us to do next… and spread the word!

If you are into F# and don’t know Community for F#, you are missing out! Community for F#, aka C4FSharp, is the brainchild of Ryan Riley. Ryan has been running C4FSharp tirelessly for years, making great content available online for the F# community.

The idea of C4FSharp is particularly appealing to me, because in my opinion, it serves a very important role. The F# community is amazingly active and friendly, but has an interesting challenge: it is highly geographically dispersed. As a result, it is often difficult to attend presentations locally, or, if you organize Meetups, to find speakers.

Ryan has been doing a phenomenal job addressing that issue, by regularly organizing online live presentations, and making them available offline as well, so that no matter where you are, you can access all that great content. The most visible result is an amazing treasure trove of F# videos on Vimeo, going back all the way to 2010. While I am giving credit where credit is due, special hats off to Rick Minerich, who has been recording the NYC meetings since forever, and making them available on Vimeo as well – and also has been lending a helping hand when C4FSharp needed assistance. Long story short, Rick is just an all-around fantastic guy, so… thanks, Rick!

In any case, the question of how to help grow the F# community has been on my mind quite a bit recently, so I was very excited when Ryan accepted to try working on this as a team, and put our ideas together. The direction I am particularly interested in is to provide support for local groups to grow. Online is great, but nothing comes close to having a good old fashioned meeting with like-minded friends to discuss and learn. So one thing I would like to see happen is for C4FSharp to become a place where you can find resources to help you guys start and run your own local group. While running a Meetup group does take a bit of effort, it’s not nearly as complicated as what people think it is, and it is very fun and rewarding. So if you want to see F# meetings in your area, just start a Meetup group!

In that frame, Ryan has put together a brand-new web page at www.c4fsharp.net, where we started listing resources. The existing videos, of course, but also a repository of hands-on Dojos and presentation/workshop material. The hands-on Dojos is something we started doing in San Francisco last year with www.sfsharp.org, and  has been working really well. Instead of a classic presentation, the idea is to create a fun coding problem, sit down in groups and work on it, learn from each other, and share. It’s extremely fun, and, from a practical standpoint, it’s also very convenient, because you don’t need to fly in a speaker to present. Just grab the repository from GitHub, look at the instructions, and run with it!

Just to whet your appetite, here is a small selection of the amazing images that came out of running the Fractal Forest Dojo in Nashville and San Francisco this month: