Mathias Brandewinder on .NET, F#, VSTO and Excel development, and quantitative analysis / machine learning.
by Mathias 26. May 2013 09:06

I got interested in the following question lately: given a data set of examples with some continuous-valued features and discrete classes, what’s a good way to reduce the continuous features into a set of discrete values?

What makes this question interesting? One very specific reason is that some machine learning algorithms, like Decision Trees, require discrete features. As a result, potentially informative data has to be discarded. For example, consider the Titanic dataset: we know the age of passengers of the Titanic, or how much they paid for their ticket. To use these features, we would need to reduce them to a set of states, like “Old/Young” or “Cheap/Medium/Expensive” – but how can we determine what states are appropriate, and what values separate them?

More generally, it’s easier to reason about a handful of cases than a continuous variable – and it’s also more convenient computationally to represent information as a finite set states.

So how could we go about identifying a reasonable way to partition a continuous variable into a handful of informative, representative states?

In the context of a classification problem, what we are interested in is whether the states provide information with respect to the Classes we are trying to recognize. As far as I can tell from my cursory review of what’s out there, the main approaches use either Chi-Square tests or Entropy to achieve that goal. I’ll leave aside Chi-Square based approaches for today, and look into the Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani in 1993.

The algorithm idea

The algorithm hinges on two key ideas:

  • Data should be split into intervals that maximize the information, measured by Entropy,
  • Partitioning should not be too fine-grained, to avoid over-fitting.

The first part is classic: given a data set, split in two halves, based on whether the continuous value is above or below the “splitting value”, and compute the gain in entropy. Out of all possibly splitting values, take the one that generates the best gain – and repeat in a recursive fashion.

Let’s illustrate on an artificial example – our output can take 2 values, Yes or No, and we have one continuous-valued feature:

Continuous Feature Output Class
1.0 Yes
1.0 Yes
2.0 No
3.0 Yes
3.0 No

As is, the dataset has an Entropy of H = - 0.6 x Log (0.6) – 0.4 x Log (0.4) = 0.67 (5 examples, with 3/5 Yes, and 2/5 No).

The Continuous Feature takes 3 values: 1.0, 2.0 and 3.0, which leaves us with 2 possible splits: strictly less than 2, or strictly less than 3. Suppose we split on 2.0 – we would get 2 groups. Group 1 contains Examples where the Feature is less than 2:

Continuous Feature Output Class
1.0 Yes
1.0 Yes

The Entropy of Group 1 is H(g1) = - 1.0 x Log(1.0) = 0.0

Group 2 contains the rest of the examples:

Continuous Feature Output Class
2.0 No
3.0 Yes
3.0 No

The Entropy of Group 2 is H(g2) = - 0.33 x Log(0.33) – 0.66 x Log(0.66) = 0.63

Partitioning on 2.0 gives us a gain of H – 2/5 x H(g1) – 3/5 x H(g2) = 0.67 – 0.4 x 0.0 – 0.6 x 0.63 = 0.04. That split gives us additional information on the output, which seems intuitively correct, as one of the groups is now formed purely of “Yes”. In a similar fashion, we can compute the information gain of splitting around the other possible value, 3.0, which would give us a gain of 0.67 – 0.6 x 0.63 – 0.4 x 0.69 =  - 0.00: that split doesn’t improve information, so we would use the first split (or, if we had multiple splits with positive gain, we would take the split leading to the largest gain).

So why not just recursively apply that procedure, and split our dataset until we cannot achieve information gain by splitting further? The issue is that we might end up with an artificially fine-grained partition, over-fitting the data.

As an illustration, consider the following contrived example:

Continuous Feature Output Class
1.0 Yes
2.0 No
3.0 Yes
4.0 No

From a “human perspective”, the Continuous Feature looks fairly uninformative. However, if we apply our recursive split, we’ll end up doing something like this (hope the notation is understandable):

[ 1.0; 2.0; 3.0; 4.0 ] –> [ 1.0 ], [ 2.0; 3.0; 4.0 ]  –>  [ 1.0 ], [ 2.0 ], [ 3.0 ; 4.0 ] –> [ 1.0 ], [ 2.0 ], [ 3.0 ], [ 4.0 ].

At every step, extracting a single Example increases our information, and the final result  has a clear over-fitting problem, with each Example forming its own group.

To address this issue, we need a “compensating force”, to penalize the formation of blocks that are too small. For that purpose, the algorithm uses a criterion based on the Minimum Description Length principle (MDL). From what I gather, conceptually, the MDL principle “basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself” [source]. In this case, our model is pretty terrible, because to represent the data, we end up using all of the data itself.

This idea appears in the full algorithm as an additional condition: a split will be accepted only if the entropy gain is greater than a minimum level, given by the formula

gain >= (1/N) x log2(N-1)  + (1/N) x [ log2 (3k-2) - (k x Entropy(S) – k1 x Entropy(S1) – k2 x Entropy(S2) ]

where N is the number of elements in the group to be split, and k the number of Classes in a group.

The derivation of that stopping criterion is way beyond my level in information theory (look at the Fayyad and Irani article listed below if you are curious about the details, it’s pretty interesting), so I won’t make a fool of myself and attempt to explain it. At a very high-level, though, with heavy hand-waving, the formula appears to make some sense:

  • (1/N) x log2(N-1) decreases to 0 as N goes to infinity; this introduces a penalty on splitting smaller datasets (to an extent),
  • (1/N) x [ log2 (3k-2) - (k x Entropy(S) – k1 x Entropy(S1) – k2 x Entropy(S2) ] favors splits which “cleanly” separate classes (if k > k1 or k2 , the penalty is reduced),
  • where log2 (3k-2) is coming from is not at all obvious to me.


Here is my naïve implementation of the algorithm (available here on GitHub):

namespace Discretization

// Recursive minimal entropy partitioning,
// based on Fayyad & Irani 1993. 
// See the following article, section 3.3,
// for a description of the algorithm:
// Note: this can certainly be optimized.
module MDL =

    open System
    // Logarithm of n in base b
    let logb n b = log n / log b

    let entropy (data: (_ * _) seq) =
        let N = data |> Seq.length |> (float)
        |> Seq.countBy snd
        |> Seq.sumBy (fun (_,count) -> 
            let p = (float)count/N
            - p * log p)

    // A Block of data to be split, with its
    // relevant characteristics (size, number
    // of classes, entropy)
    type Block (data: (float * int) []) =
        let s = data |> Array.length |> (float)
        let classes = data |> snd |> Set.ofArray |> Set.count
        let k = classes |> (float)
        let h = entropy (data)
        member this.Data = data
        member this.Classes = classes
        member this.S = s
        member this.K = k
        member this.H = h

    // Entropy gained by splitting "original" block
    // into 2 blocks "left" and "right"
    let private entropyGain (original:Block) (left:Block) (right:Block) =
        original.H - 
        ((left.S / original.S) * left.H + (right.S / original.S) * right.H)

    // Minimum entropy gain required
    // for a split of the "original" block
    // into 2 blocks "left" and "right"
    let private minGain (original:Block) (left:Block) (right:Block) =
        let delta = 
            logb (pown 3. original.Classes - 2.) 2. - 
            (original.K * original.H - left.K * left.H - right.K * right.H)
        ((logb (original.S - 1.) 2.) / original.S) + (delta / original.S)

    // Identify the best acceptable value
    // to split a block of data
    let split (data:Block) =
        // Candidate values to use as split
        // We remove the smallest, because
        // by definition no value is smaller
        let candidates = 
            |> fst 
            |> Seq.distinct
            |> Seq.sort
            |> Seq.toList
            |> List.tail

        let walls = seq { 
            for value in candidates do
                // Split the data into 2 groups,
                // below/above the value
                let g1, g2 = 
                    |> Array.partition (fun (v,c) -> v < value)

                let block1 = Block(g1)
                let block2 = Block(g2)
                let gain = entropyGain data block1 block2
                let threshold = minGain data block1 block2

                // if minimum threshold is met,
                // the value is an acceptable candidate
                if gain >= threshold 
                then yield (value, gain, block1, block2) }

        if (Seq.isEmpty walls) then None
            // Return the split value that
            // yields the best entropy gain
            |> Seq.maxBy (fun (value, gain, b1, b2) -> gain) 
            |> Some

    // Top-down recursive partition of a data block,
    // accumulating the partitioning values into
    // a list of "walls" (splitting values)
    let partition (data:Block) = 
        let rec recursiveSplit (walls:float list) (data:Block) =
            match (split data) with
            | None -> walls // no split found
            | Some(value, gain, b1, b2) ->
                // append new split value
                let walls = value::walls
                // Search for new splits in first group
                let walls = recursiveSplit walls b1
                // Search for new splits in second group
                recursiveSplit walls b2
        // and go search!
        recursiveSplit [] data |> List.sort

The code appears to work, and is fairly readable / clean. I am not fully pleased with it, though. It’s a bit slow, and I have this nagging feeling that there is a much cleaner way to write that algorithm. I also dislike casting the counts to floats, but that’s the best way I found to avoid a proliferation of casts everywhere in the formulas, which operate mostly on floats (eg. log or proportions).

To avoid re-computing entropies, and counts of elements and classes, I introduced a Block class, which represents a block of data to be split – an array of (float * int), where the float is the continuous value and the int the label / index of the class. The algorithm recursively attempts to break blocks, and accumulates “walls” / split points along the way; it looks up every float value in the current block as a potential split point, generates a sequence of valid candidates, picks the one that generates the largest gain, and keeps searching in the two resulting blocks.


So… does it work? This is by no means a complete validation (see the References below for some more rigorous analysis), but I thought I would at least try it on some synthetic data. The test script is on GitHub as well):

#load "MDL.fs"
open System
open Discretization.MDL

let rng = System.Random()

let tests = [
    // one single class
    [|  for i in 1..100 -> (rng.NextDouble(), 0) |];
    // class 0 from 0 to 1, class 1 from 1 to 2
    [|  for i in 1..100 -> (rng.NextDouble(), 0)
        for i in 1..100 -> (rng.NextDouble() + 1.0, 1) |];
    // overlapping classes
    [|  for i in 1..100 -> (rng.NextDouble(), rng.Next(0,2))
        for i in 1..100 -> (rng.NextDouble() + 0.5, rng.Next(1,3))
        for i in 1..100 -> (rng.NextDouble() + 1.0, rng.Next(2,4))
        for i in 1..100 -> (rng.NextDouble() + 1.5, rng.Next(3,5)) |];
    [|  for i in 0 .. 100 -> ((float)i, if i % 2 = 0 then 0 else 1) |]; ]

tests |> List.iter (fun (title, testData) ->
    printfn "Sample: %s" title
    let data = Block(testData)
    let result = partition data
    printfn "[ %s ]" (String.Join(", ", result)))

“tests” is a list of names + datasets, which we pass through the partition function.

  • “Single” is a trivial single class (we expect no partition),
  • “Separate” is a 2-class dataset, perfectly separated (0s are in 0 to 1, 1s in 1 to 2) (we expect a partition at 1)
  • “Mixture” is a 5-class dataset, with overlaps; we expect partitions at 0.5, 1, 1.5 and 2.
  • “Alternating” is the degenerate case we described earlier, with a sequence of 0, 1, 0, 1, … – we hope to get no partition.

Running this in FSI produces the following:

Sample: Single
[  ]
Sample: Separate
[ 1.00274506863334 ]
Sample: Mixture
[ 0.520861916486575, 1.00554223847834, 1.55028371561798, 1.97660811872995 ]
Sample: Alternating
[  ]

… looks like the algorithm is handling these obvious cases just the way it should.

That’s it for today. I’ll come back to the topic of discretization soon, this time looking at Khiops / Chi-Square based approaches. In the meanwhile, maybe this will come in handy for some of you – and let me know if you have comments or questions!


Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning”: the original Fayyad and Irani article, with a derivation of the stopping criterion.

Supervised and Unsupervised Discretization of Continuous Features”: a discussion and comparison of a few discretization approaches.

Khiops: A Statistical Discretization Method of Continuous Attributes”: primarily focused on Chi-Square based approaches, a comparison with the MDL model at the end.


5/29/2013 2:09:05 AM #

Jack Fox

Really elegant solution...the paper you site goes on my "to read" stack. Looking forward to seeing the Chi-Square approach. After that your thoughts on how the two approaches do with your favorite real-world datasets. How important is algorithm efficiency when you are dealing with real data sets. Does either solution become too slow at some scale?

Jack Fox United States | Reply

6/4/2013 4:45:13 AM #


Thanks! I have a soft spot for entropy based approach, there is something I find really fascinating about the approach. And I am curious too about the comparison with Chi-square, which uses a bottom-up approach instead of top-down here.

Mathias United States | Reply

6/2/2013 8:01:04 AM #


Pingback from

F# Weekly #22 2013  | Sergey Tihon's Blog | Reply

6/3/2013 7:30:23 PM #


I wonder what the link beetween MDL and information bottleneck is.

nicolas United States | Reply

6/4/2013 4:52:31 AM #


Can you elaborate on what you mean by information bottleneck? My general understanding is that MDL measures how good the model is at representing the data, and how much information it takes to represent the model itself. I find the idea fascinating: it's measuring the cost and information gain together, in consistent units, which helps prevent over-fitting: if you try to fit the data too closely, the complexity (cost) of the model required to do so will offset the gain.
I hope I didn't 100% miss your point Smile

Mathias | Reply

6/9/2013 6:59:13 PM #


I have the same understanding of it, and i think it is a fundamental notion, surprinsingly in some way, just like 'computability' can be surprinsingly fundamental. But it seemed that 'information bottleneck' is not far, in some way that I will try to see. I found this overview which seems interesting to put a few methods on the map

nicolas United States | Reply

7/5/2013 3:51:37 PM #


Random Forest classification in F#: first cut

Random Forest classification in F#: first cut

Clear Lines Blog | Reply

7/6/2013 3:27:27 AM #

Jack Fox

Should discretizing be performed against a sample set or the entire problem set? Overfitting is always to be avoided, but in the case of deriving optimal discrete categories does evaluating the entire set potentially reveal important information?

Jack Fox United States | Reply

7/6/2013 9:04:32 AM #


Thanks for the comment! Per se, you can discretize whichever way you want. I think the trade-off here is between discretizing globally, with a larger sample and less risk of over-fitting, or locally, with more "sensitivity" but with the flip-side of the previous. Consider this situation: suppose (hypotetically) that age matters only for some small category (say, 3rd class male passengers). Looking at the entire dataset, you might not pick that up because it would be masked by the larger sample where the feature is irrelevant, whereas if you discretized at each node, you would actually pick up the partition in the relevant group(s). You could even construct artificial examples, like a feature which has an effect on one group, and the opposite on the other - you would miss that globally, and find it only locally. At the same time, the smaller the groups, the less reliable discretizing may become.

Mathias United States | Reply

9/15/2013 12:28:44 AM #


Pingback from

Governor Pence returns from Japan trade mission | All Inclusive Caribbean Vacation Packages | Reply

7/2/2015 6:05:29 PM #


Pingback from

Feature importances, discretization and criterion in decision trees - DexPage | Reply

8/2/2015 12:44:21 AM #


Pingback from

entropy based information gain | Informationen | Reply

1/14/2016 6:55:29 AM #


Pingback from

Feature importances, discretization and criterion in decision trees - classification | Reply

12/18/2016 8:27:30 PM #


Pingback from

where can i buy dianabol | Reply

1/8/2017 9:22:25 PM #


Pingback from

cypionate for sale | Reply

3/13/2017 6:57:48 AM #

Pingback from

Открытый курс машинного обучения. Тема 3. Классификация, деревья решений и метод ближайших соседей – Русский Эпик | Reply

3/14/2019 4:20:21 PM #


Pingback from

深入浅出机器学习中的决策树(一) – 闪念基å› | Reply

Add comment

  Country flag

  • Comment
  • Preview


Comment RSS