Skip to main content

Phonetic comparison of strings

I was working on data clean up to some schools database. Data can be in any language.
One of the thing that I have to as part of this, is to find possible duplicates. First we tried out soundex algorithm, which has build in sql server, it is a good algorithm but is most suitable names and surnames (as it was developed for use in census ). Next algorithm we tried is Levenshtien distance, this takes 2 strings and gives no of character different in both strings. For a table with 10k rows for checking one field for possible duplicates it was taking about 15 min, which is quite long. Levenshtien does not well for string with spaces.
Which make reduces its usefulness very much.
So I searched around for few better algorithms. I found a soundex derivative 'Daitch–Mokotoff Soundex' , but it is also for surname with Slavic and Germanic support. Another one I found is Metaphone, which is suitable for most English words, but it is somewhat suitable for english related language (like spanish), but for language like japanese,korean etc.. it doesn't work at all. For strings with spaces it is not much accurate.
There is an improved version of Metaphone named Double-Metaphone. On language wise it is same as Metaphone. For strings with spaces it is much better, it identifies both "bank of india" and "bankofindia" as
same, while non of other algorithm is able to find it. Here are the queries and there output 


select soundex('bank of india'),soundex('bankofindia')
-- B520, B521
select dbo.Levenshtein('bank of india','bankofindia')
--NULL
select dbo.Metaphone('bank of india'),dbo.Metaphone('bankofindia')
--bnk of int, bnkfnt
select dbo.DoubleMetaPhone('bank of india'),dbo.DoubleMetaPhone('bankofindia')
--PNKFNPNKFN,PNKFNPNKFN

Comments

Popular posts from this blog

opensuse repair is awesome

I installed OpenSuse 11.1 on my machine. I have done kde4/kde3/gnome installation. It takes about 40-45 minutes. After instaltion I installed Nvidia driver, nvidia driver has installed new kernel (containing trace in name). When I changed boot order to make trace default kernel, grub is installed on root partition than MBR, so my system become unbootable. I got only 2 days of holidays. I come to know abt this problem after I returned to my work city (Indore) from Home (Khandwa). So I could not fix the problem. So I have to fix it via phone. My brother booted the system with Opensuse 11.1 dvd. My brother is not very technical person but he is advace-level PC user. We have never used recover/rescue installed system option. We selected automated recovery mode. It first checks all partitions and packages, all of them found is good state, setup found error boot loader configuration, we loaded boot loader configuration from disk and found that grub is installed on root partition instead of M...

Adding additional class to a button in drupal

Recently we were building a multi-domain site, we are using same theme for all sites but we want to have different colors for buttons for each domain. We achieved this using my overriding theme_button in template.php of theme, here is the snippet. /** * Overwrite theme_button() * @file template.php */ function mytheme_button($variables) { $element = $variables['element']; // Add some extra conditions to make sure we're only adding // the classto the right submit button Now add our custom class if (isset($element['#attributes'])) { $element['#attributes']['class'] [] = 'button'; $domain = domain_get_domain(); $element['#attributes']['class'] [] = $domain['machine_name']; } $variables['element'] = $element; return theme_button($variables); }

OpenSuse 12.2 : Faster, better

After about 2 months delay from initial schedule, OpenSuse 12.2 finally arrived. During recent time, contribution to OpenSuse has increase a lot, so it is taking time review and accept/reject those changes, this is main cause of delay. OpenSuse is working on a mechanism to handle increased traffic. From the release announcement “ The latest release of the world’s most powerful and flexible Linux Distribution brings you speed-ups across the board with a faster storage layer in Linux 3.4 and accelerated functions in glibc and Qt, giving a more fluid and responsive desktop. The infrastructure below openSUSE has evolved, bringing in mature new technologies like GRUB2 and Plymouth and the first steps in the direction of a revised and simplified UNIX file system hierarchy. Users will also notice the added polish to existing features bringing an improved user experience all over. The novel Btrfs file system comes with improved error handling and recovery tools, GNOME 3.4, developing rapidly,...