Monday, August 29, 2005

PHP: Longest Common Substring

While working on my answer machine program, I came across a need to find recurring patterns in strings. What I wanted to do was take two strings and return the longest substring contained by both. After about 30 minutes worth of searching Google for a quick solution, I realized that this was not a trivial problem. It seems that finding the longest common substring is a classical problem in computer science. Helpful examples of LCS algorithms were very hard to come by, and there was virtually nothing available for PHP. Maybe when I earn a PhD I will try to impress people with useless formulas and diagrams, but for now I figured I would just write some realtively straightforward code that gets the job done.

The solution that seemed the simplest to me was to compare the strings character by character looking for matches that would signify the possible beginning of a common substring. I really don't consider one character substrings to be significant, so my solution only initiates a match when it finds two identical substrings two characters in length. It continues down each string char-by-char as long they match, and then tosses the completed substring into an array. Every common substring is obtained this way, and then we simply return the longest of these as the result. If more than one common substring is longest, the one that appears first in the first source string is returned.

See it in action and view the source.

After all that, I'm starting to think it might be more appropriate for my application to tokenize the strings and then capture the longest common sequence of tokens. In other words, break the string down into words and return the longest common phrase. It won't be that much more difficult to implement both methods in my answer machine, and compare their performance to see which one is better.

2 comments:

Unknown said...

Hello, could you send me your php code to find the longest common substring ?

Thanks in advance

Charlie said...

Archive.org to the rescue!

http://web.archive.org/web/20090823033402/http://artificialminds.net/labs/strlcs.txt