by mathias 16. January 2010 11:15

One of my recent posts looked into reading the VBA code attached to a workbook, and lead to a discussion on analyzing the differences between the macros of two workbooks – what is commonly called a “diff”.

This got me curious as to how diffs are generated. A quick search lead to the Longest Common Subsequence problem: once (one of) the longest common sub-sequence (abbreviated LCS from now on) of characters between two texts has been identified, it is straightforward to determine what has been added and removed from the original text to get the second text.

Example

Original: this is my great code
Modified: that is my awesome code

LCS: th is my  code

Changes to original: this is my great code

The idea behind the algorithm used to identify such a longest common subsequence (LCS) is a nice example of dynamic programming, and goes along these lines. If I have two sequences,

  • if they start with the same character (the head), then their LCS is the head + the LCS of the right-hand remainder of each string (the tail),
  • if they don’t start with the same character, then their LCS could start either with the head of the first sequence, or of the second one. Their LCS is the longest of the LCS of the first sequence and the tail of the second, and of the LCS of the second sequence and the tail of the first one.

More...