Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 10. February 2013 11:25

And the Journey converting “Machine Learning in Action” from Python to F# continues! Rather than following the order of the book, I decided to skip chapters 8 and 9, dedicated to regression methods (regression is something I spent a bit too much time doing in the past to be excited about it just right now), and go straight to Unsupervised Learning, which begins with the K-means clustering algorithm.

So what is clustering about?

In a nutshell, clustering focuses on the following question: given a set of observations, can the computer figure out a way to classify them into “meaningful groups”? The major difference with Classification methods is that in clustering, the Categories / Groups are initially unknown: it’s the algorithm’s job to figure out sensible ways to group items into Clusters, all by itself (hence the word “unsupervised”).

Chapter 10 covers 2 clustering algorithms, k-means , and bisecting k-means. We’ll discuss only the first one today.

The underlying idea behind the k-means algorithm is to identify k “representative archetypes” (k being a user input), the Centroids. The algorithm proceeds iteratively:

  • Starting from k random Centroids,
  • Observations are assigned to the closest Centroid, and constitute a Cluster,
  • Centroids are updated, by taking the average of their Cluster,
  • Until the allocation of Observation to Clusters doesn’t change any more.

When things go well, we end up with k stable Centroids (minimal modification of Centroids do not change the Clusters), and Clusters contain Observations that are similar, because they are all close to the same Centroid (The wikipedia page for the algorithm provides a nice graphical representation).

F# implementation

The Python implementation proposed in the book is both very procedural and deals with Observations that are vectors. I thought it would be interesting to take a different approach, focused on functions instead. The current implementation is likely to change when I get into bisecting k-means, but should remain similar in spirit. Note also that I have given no focus to performance – this is my take on the easiest thing that would work.

The entire code can be found here on GitHub.

Here is how I approached the problem. First, rather than restricting ourselves to vectors, suppose we want to deal with any generic type. Looking at the pseudo-code above, we need a few functions to implement the algorithm:

  • to assign Observations of type ‘a to the closest Centroid ‘a, we need a notion of Distance,
  • we need to create an initial collection of k Centroids of type ‘a, given a dataset of ‘as,
  • to update the Centroids based on a Cluster of ‘as, we need some aggregation function.

Let’s create these 3 functions:

    // the Distance between 2 observations 'a is a float
    // It also better be positive - left to the implementer
    type Distance<'a> = 'a -> 'a -> float
    // CentroidsFactory, given a dataset, 
    // should generate n Centroids
    type CentroidsFactory<'a> = 'a seq -> int -> 'a seq
    // Given a Centroid and observations in a Cluster,
    // create an updated Centroid
    type ToCentroid<'a> = 'a -> 'a seq -> 'a

We can now define a function which, given a set of Centroids, will return the index of the closest Centroid to an Observation, as well as the distance from the Centroid to the Observation:

    // Returns the index of and distance to the 
    // Centroid closest to observation
    let closest (dist: Distance<'a>) centroids (obs: 'a) =
        centroids
        |> Seq.mapi (fun i c -> (i, dist c obs)) 
        |> Seq.minBy (fun (i, d) -> d)

Finally, we’ll go for the laziest possible way to generate k initial Centroids, by picking up k random observations from our dataset:

    // Picks k random observations as initial centroids
    // (this is very lazy, even tolerates duplicates)
    let randomCentroids<'a> (rng: System.Random) 
                            (sample: 'a seq) 
                            k =
        let size = Seq.length sample
        seq { for i in 1 .. k do 
              let pick = Seq.nth (rng.Next(size)) sample
              yield pick }

We have all we need – we can now write the algorithm itself:

// Given a distance, centroid factory and
// centroid aggregation function, identify
// the k centroids of a dataset
let kmeans (dist: Distance<'a>) 
           (factory: CentroidsFactory<'a>) 
           (aggregator: ToCentroid<'a>)
           (dataset: 'a seq) 
           k =
    // Recursively update Centroids and
    // the assignment of observations to Centroids
    let rec update (centroids, assignment) =
        // Assign each point to the closest centroid
        let next = 
            dataset 
            |> Seq.map (fun obs -> closest dist centroids obs)
            |> Seq.toList
        // Check if any assignment changed
        let change =
            match assignment with
            | Some(previous) -> 
                Seq.zip previous next    
                |> Seq.exists (fun ((i, _), (j, _)) -> not (i = j))
            | None -> true // initially we have no assignment
        if change 
        then 
            // Update each Centroid position:
            // extract cluster of points assigned to each Centroid
            // and compute the new Centroid by aggregating cluster
            let updatedCentroids =
                let assignedDataset = Seq.zip dataset next
                centroids 
                |> Seq.mapi (fun i centroid -> 
                    assignedDataset 
                    |> Seq.filter (fun (_, (ci, _)) -> ci = i)
                    |> Seq.map (fun (obs, _) -> obs)
                    |> aggregator centroid)
            // Perform another round of updates
            update (updatedCentroids, Some(next))
        // No assignment changed, we are done
        else (centroids, next)

    let initialCentroids = factory dataset k
    let centroids = update (initialCentroids, None) |> fst |> Seq.toList        
    let classifier = fun datapoint -> 
        centroids 
        |> List.minBy (fun centroid -> dist centroid datapoint)
    centroids, classifier

The meat of the algorithm is the update function. It takes in a set of current Centroids, and an optional Assignment of Observations to Centroids, represented as a list, mapping each Observation to Centroid indexes and corresponding distance. Note that we could drop the distance for the assignment – it’s never used afterwards, I added it prematurely because it is needed in the bissecting k-means algorithm.

The update function is recursive – it computes what Centroid / Cluster each observation will be assigned to next, checks whether any Observation has been assigned to a different Cluster than before (or if there is an assignment at all, to cover the initial case when no assignment has been computed yet). If a change occurred, new Centroids are computed and we go for another round, and otherwise we are done.

The outer function calls update, and once it terminates, returns the Centroids that have been identified, as well as a Classifier function, which will return the closest Centroid to an Observation.

The algorithm in action

I created two small examples illustrating the algorithm in action: one classic, with numeric observations, and one “just for kicks”, attempting to cluster a collection of strings. Both can be found in the file Chapter10.fsx.

The classic case operates on an artificially created dataset: we generate 3 points in 3 dimensions, and a collection of 50 points randomly generated in spheres around these 3 points:

let rng = new System.Random()
let centroids = [ [| 0.; 0.; 0. |]; [| 20.; 30.; 40. |]; [| -40.; -50.; -60. |] ]
// Create 50 points centered around each Centroid
let data = [ 
    for centroid in centroids do
        for i in 1 .. 50 -> 
            Array.map (fun x -> x + 5. * (rng.NextDouble() - 0.5)) centroid ]

If everything works correctly, we expect the algorithm to identify 3 Centroids close to the 3 points we used as anchor points for our data sample. We need to define 2 functions, which are included in the main module: a Distance, and a function to compute a Centroid from a Cluster of Observations:

// Euclidean distance between 2 points, represented as float []
let euclidean x y = 
    Array.fold2 (fun d e1 e2 -> d + pown (e1 - e2) 2) 0. x y 
    |> sqrt

// Recompute Centroid as average of given sample
let avgCentroid (current: float []) (sample: float [] seq) =
    let size = Seq.length sample
    match size with
    | 0 -> current
    | _ ->
        sample
        |> Seq.reduce (fun v1 v2 -> 
               Array.map2 (fun v1x v2x -> v1x + v2x) v1 v2)
        |> Array.map (fun e -> e / (float)size)

Armed with this, we can run the algorithm:

let factory = randomCentroids<float[]> rng
let identifiedCentroids, classifier = kmeans euclidean factory avgCentroid data 3
printfn "Centroids identified"
identifiedCentroids 
|> List.iter (fun c -> 
    printfn ""
    printf "Centroid: "
    Array.iter (fun x -> printf "%.2f " x) c)

On my machine, this produces the following:

Centroids identified

Centroid: 19.93 30.32 39.89
Centroid: -39.98 -50.10 -59.69
Centroid: -0.28 0.43 -0.01

The 3 centroids are exactly what we expect – 3 points close to {20; 30; 40}, {-40; –50; -60} and {0; 0; 0}. Things seem to be working.

Now I was curious to see if this would be usable on something completely different, like strings. As usual, in order to make that work, we need a Distance, and a way to reduce a Cluster to a Centroid. The most obvious choice for a Distance between strings is the Levenshtein distance, which measures how many edits are required to transform a string into another. Fortunately for me, someone already provided an implementation in F#, which I shamelessly lifted.

The Centroid update question required a bit of thinking. Obviously, computing the average of strings isn’t going to work – so how could we find a good “representative string” from a Cluster? I decided to go for something fairly simple: pick the string in the Cluster which has the least worst-case distance to all the others (as an alternative, I also tried picking the string with the lowest sum of squares distance, which produced similar results).

Finally, I created a sample, using a collection of 53 words sharing three different roots:  “GRAPH”, “SCRIPT” and “GRAM”. Results vary from run to run (not surprisingly, the algorithm often struggles to separate GRAPH and GRAM words), but overall I was pleasantly surprised by the results:

Words identified
TELEGRAPHIC
RADIOGRAM
PRESCRIPTIVE


Classification of sample words
AUTOBIOGRAPHER -> TELEGRAPHIC
AUTOBIOGRAPHICAL -> TELEGRAPHIC
AUTOBIOGRAPHY -> TELEGRAPHIC
AUTOGRAPH -> RADIOGRAM
BIBLIOGRAPHIC -> TELEGRAPHIC
BIBLIOGRAPHY -> TELEGRAPHIC
CALLIGRAPHY -> TELEGRAPHIC
CARTOGRAPHY -> RADIOGRAM
CRYPTOGRAPHY -> RADIOGRAM
GRAPH -> TELEGRAPHIC
HISTORIOGRAPHY -> TELEGRAPHIC
PARAGRAPH -> TELEGRAPHIC
SEISMOGRAPH -> TELEGRAPHIC
STENOGRAPHER -> TELEGRAPHIC
TELEGRAPH -> TELEGRAPHIC
TELEGRAPHIC -> TELEGRAPHIC
BIBLIOGRAPHICAL -> TELEGRAPHIC
STEREOGRAPH -> TELEGRAPHIC
DESCRIBABLE -> PRESCRIPTIVE
DESCRIBE -> PRESCRIPTIVE
DESCRIBER -> PRESCRIPTIVE
DESCRIPTION -> PRESCRIPTIVE
DESCRIPTIVE -> PRESCRIPTIVE
INDESCRIBABLE -> PRESCRIPTIVE
INSCRIBE -> PRESCRIPTIVE
INSCRIPTION -> PRESCRIPTIVE
POSTSCRIPT -> PRESCRIPTIVE
PRESCRIBE -> PRESCRIPTIVE
PRESCRIPTION -> PRESCRIPTIVE
PRESCRIPTIVE -> PRESCRIPTIVE
SCRIBAL -> RADIOGRAM
SCRIBBLE -> PRESCRIPTIVE
SCRIBE -> PRESCRIPTIVE
SCRIBBLER -> RADIOGRAM
SCRIPT -> PRESCRIPTIVE
SCRIPTURE -> PRESCRIPTIVE
SCRIPTWRITER -> PRESCRIPTIVE
SUPERSCRIPT -> PRESCRIPTIVE
TRANSCRIBE -> PRESCRIPTIVE
TYPESCRIPT -> PRESCRIPTIVE
TRANSCRIPTION -> PRESCRIPTIVE
DESCRIPTOR -> PRESCRIPTIVE
ANAGRAM -> RADIOGRAM
CABLEGRAM -> RADIOGRAM
CRYPTOGRAM -> RADIOGRAM
GRAMMAR -> RADIOGRAM
GRAMMARIAN -> RADIOGRAM
GRAMMATICAL -> RADIOGRAM
MONOGRAM -> RADIOGRAM
RADIOGRAM -> RADIOGRAM
TELEGRAM -> TELEGRAPHIC
UNGRAMMATICAL -> TELEGRAPHIC
AEROGRAM -> RADIOGRAM

That’s it for today! In our next “ML in Action” episode, we’ll look into the bissecting k-means algorithm, which is a variation on today’s algorithm, and probably revisit the implementation. In the meanwhile, feel free to leave comments or feedback!

Resources

Source code on GitHub: the relevant code is in the files KMeansClustering.fs and Chapter10.fsx.

K-means algorithm on Wikipedia.

Levenshtein distance on Wikipedia, and an F# implementation of Levenshtein distance.

Interesting discussion on the Levenshtein distance on @Rickasaurus’ blog.

Another K-means implementation in F#, from @DevShorts.

Root Words: an intriguing web page, providing help to learn words and vocabulary, which contains a list of words roots. It has one incredibly annoying feature – you can’t copy paste text from the page.

Comments

Add comment




  Country flag

biuquote
  • Comment
  • Preview
Loading



Comments

Comment RSS