Thursday, November 24, 2011

Phonetic comparison of strings

I was working on data clean up to some schools database. Data can be in any language.
One of the thing that I have to as part of this, is to find possible duplicates. First we tried out soundex algorithm, which has build in sql server, it is a good algorithm but is most suitable names and surnames (as it was developed for use in census ). Next algorithm we tried is Levenshtien distance, this takes 2 strings and gives no of character different in both strings. For a table with 10k rows for checking one field for possible duplicates it was taking about 15 min, which is quite long. Levenshtien does not well for string with spaces.
Which make reduces its usefulness very much.
So I searched around for few better algorithms. I found a soundex derivative 'Daitch–Mokotoff Soundex' , but it is also for surname with Slavic and Germanic support. Another one I found is Metaphone, which is suitable for most English words, but it is somewhat suitable for english related language (like spanish), but for language like japanese,korean etc.. it doesn't work at all. For strings with spaces it is not much accurate.
There is an improved version of Metaphone named Double-Metaphone. On language wise it is same as Metaphone. For strings with spaces it is much better, it identifies both "bank of india" and "bankofindia" as
same, while non of other algorithm is able to find it. Here are the queries and there output 


select soundex('bank of india'),soundex('bankofindia')
-- B520, B521
select dbo.Levenshtein('bank of india','bankofindia')
--NULL
select dbo.Metaphone('bank of india'),dbo.Metaphone('bankofindia')
--bnk of int, bnkfnt
select dbo.DoubleMetaPhone('bank of india'),dbo.DoubleMetaPhone('bankofindia')
--PNKFNPNKFN,PNKFNPNKFN