by mathias 16. January 2010 11:15

One of my recent posts looked into reading the VBA code attached to a workbook, and lead to a discussion on analyzing the differences between the macros of two workbooks – what is commonly called a “diff”.

This got me curious as to how diffs are generated. A quick search lead to the Longest Common Subsequence problem: once (one of) the longest common sub-sequence (abbreviated LCS from now on) of characters between two texts has been identified, it is straightforward to determine what has been added and removed from the original text to get the second text.

Example

Original: this is my great code
Modified: that is my awesome code

LCS: th is my  code

Changes to original: this is my great code

The idea behind the algorithm used to identify such a longest common subsequence (LCS) is a nice example of dynamic programming, and goes along these lines. If I have two sequences,

  • if they start with the same character (the head), then their LCS is the head + the LCS of the right-hand remainder of each string (the tail),
  • if they don’t start with the same character, then their LCS could start either with the head of the first sequence, or of the second one. Their LCS is the longest of the LCS of the first sequence and the tail of the second, and of the LCS of the second sequence and the tail of the first one.

This sounded like a good problem to try out my new F# “skills” – here is my first take on it:

let rec LCS list1 list2 =
  if List.length list1 = 0 || List.length list2 = 0 then
    List.Empty
  else
    let tail1 = List.tail list1
    let tail2 = List.tail list2
    if List.head list1 = List.head list2 then      
      List.head list1 :: LCS tail1 tail2
    else
      let candidate1 = LCS list1 tail2
      let candidate2 = LCS tail1 list2
      if List.length candidate1 > List.length candidate2 then
        candidate1
      else
        candidate2;;

A few comments. First, it took me under 15 minutes to write this. I am sure this is far from optimal; in particular I suspect that memory usage will be rather awful for larger sequences. Still, I usually struggle with recursions, and this one just flew – and the code is, in my opinion, very understandable, and very close to the human description of the algorithm.

Then, straight off the bat, I got a generic function – and I didn’t even try to. Give it a list of characters, it will work. Feed it a list of integers, it will work, too. OK, I’ll give you that one: feed it a string, and it will fail, because it expects a list, but that should be easy enough to address. (The one thing I need to figure out is how to validate that the two lists are of the same type, but I can live with that for now). I am starting to really dig F# type inference.

Comments

1/24/2010 10:57:14 AM #

Yannick

Here is how I would write it in OCaml, with a pattern-matching which takes profit from the ability of these languages to deconstruct types:

let rec lcs list1 list2 =
  match (list1,list2) with
      head1 :: tail1, head2 :: tail2 ->
        if head1 = head2 then
          head1 :: lcs tail1 tail2
        else
          let candidate1 = lcs list1 tail2 in
          let candidate2 = lcs tail1 list2 in
          if List.length candidate1 > List.length candidate2 then
            candidate1
          else
            candidate2
    | _ -> [];;

Why do you want to validate that the lists have the same type?
The type-system already figures out that they should be, and it will refuse to compile if you pass in a list of integers and a list of chars. Here is the type in OCaml:

val lcs : 'a list -> 'a list -> 'a list

Of course, the recursion might be costly on the stack, which is why we usually rewrite such functions using a local terminal-recursive function, but here you need 2 calls to lcs when the head elements don't match, so this simple solution does not apply. Instead, you need to append to the 1st call of the two a list of computations that remain to be done in a continuation-passing style. So that this call can be made terminal recursive, and so you avoid crashing the stack.

More code to write in F#, lucky man!

Yannick France | Reply

1/24/2010 11:12:41 AM #

Mathias

Aha, I have to give some thought to the terminal-recursive idea! I ran exactly into the issue you mention - I couldn't figure out how to use tail-recursion because of the two calls. But now that you mentioned this, I checked my F# book, and sure enough, the section after tail recursion is Continuations, which seems to describe what you are talking about... I need to digest that idea!
PS Is this really OCaml? I had never seen OCaml code before, it's amazingly similar to F#!

Mathias | Reply

1/26/2010 3:15:51 PM #

Yannick

Not surprising as F# was created as a spin-of of OCaml for the something# world.
I like that there is no need for this trailing "in" in F#, it is cleaner.
And F# does not seem to rely on the same strict separation of identifiers between capitalized/uncapitalized ones. (In OCaml, LCS would have to be a module or constructor name, and cannot be a function name, because it is capitalized.)
Mostly syntax you see!

Yannick France | Reply

Add comment


(Will show your Gravatar icon)

  Country flag

biuquote
  • Comment
  • Preview
Loading