Skip to main content

Phonetic comparison of strings

I was working on data clean up to some schools database. Data can be in any language.
One of the thing that I have to as part of this, is to find possible duplicates. First we tried out soundex algorithm, which has build in sql server, it is a good algorithm but is most suitable names and surnames (as it was developed for use in census ). Next algorithm we tried is Levenshtien distance, this takes 2 strings and gives no of character different in both strings. For a table with 10k rows for checking one field for possible duplicates it was taking about 15 min, which is quite long. Levenshtien does not well for string with spaces.
Which make reduces its usefulness very much.
So I searched around for few better algorithms. I found a soundex derivative 'Daitch–Mokotoff Soundex' , but it is also for surname with Slavic and Germanic support. Another one I found is Metaphone, which is suitable for most English words, but it is somewhat suitable for english related language (like spanish), but for language like japanese,korean etc.. it doesn't work at all. For strings with spaces it is not much accurate.
There is an improved version of Metaphone named Double-Metaphone. On language wise it is same as Metaphone. For strings with spaces it is much better, it identifies both "bank of india" and "bankofindia" as
same, while non of other algorithm is able to find it. Here are the queries and there output 


select soundex('bank of india'),soundex('bankofindia')
-- B520, B521
select dbo.Levenshtein('bank of india','bankofindia')
--NULL
select dbo.Metaphone('bank of india'),dbo.Metaphone('bankofindia')
--bnk of int, bnkfnt
select dbo.DoubleMetaPhone('bank of india'),dbo.DoubleMetaPhone('bankofindia')
--PNKFNPNKFN,PNKFNPNKFN

Comments

Popular posts from this blog

opensuse repair is awesome

I installed OpenSuse 11.1 on my machine. I have done kde4/kde3/gnome installation. It takes about 40-45 minutes. After instaltion I installed Nvidia driver, nvidia driver has installed new kernel (containing trace in name). When I changed boot order to make trace default kernel, grub is installed on root partition than MBR, so my system become unbootable. I got only 2 days of holidays. I come to know abt this problem after I returned to my work city (Indore) from Home (Khandwa). So I could not fix the problem. So I have to fix it via phone. My brother booted the system with Opensuse 11.1 dvd. My brother is not very technical person but he is advace-level PC user. We have never used recover/rescue installed system option. We selected automated recovery mode. It first checks all partitions and packages, all of them found is good state, setup found error boot loader configuration, we loaded boot loader configuration from disk and found that grub is installed on root partition instead of M...

Adding additional class to a button in drupal

Recently we were building a multi-domain site, we are using same theme for all sites but we want to have different colors for buttons for each domain. We achieved this using my overriding theme_button in template.php of theme, here is the snippet. /** * Overwrite theme_button() * @file template.php */ function mytheme_button($variables) { $element = $variables['element']; // Add some extra conditions to make sure we're only adding // the classto the right submit button Now add our custom class if (isset($element['#attributes'])) { $element['#attributes']['class'] [] = 'button'; $domain = domain_get_domain(); $element['#attributes']['class'] [] = $domain['machine_name']; } $variables['element'] = $element; return theme_button($variables); }

Dana Abdulrazak - most heroic runner

Only female athlete from Iraq. She is a real hero, who come from war torn country. Olympic "superlatives" to be remembered_English_Xinhua Dana Abdulrazak - most heroic runner Photo taken on Aug. 21, 2008 shows Iraqi athlete Dana Hussein Abdulrazak hold a Chinese knot in Beijing, China. Iraqi women runner Dana Abdul-Razzaq has received as loud applause as champions. The Iraqi team's only woman faced many obstacles to reach Beijing, from a sniper's bullets to a lack of training facilities and religious and cultural opposition to female athletes. Before the Olympics, she was told that Iraq was shut out of the international sports gala. When her coach consoled her by saying that she could take part in the 2012 Olympics, she broke into tears, "who knows if I could live that long!"