Skip to main content

Phonetic comparison of strings

I was working on data clean up to some schools database. Data can be in any language.
One of the thing that I have to as part of this, is to find possible duplicates. First we tried out soundex algorithm, which has build in sql server, it is a good algorithm but is most suitable names and surnames (as it was developed for use in census ). Next algorithm we tried is Levenshtien distance, this takes 2 strings and gives no of character different in both strings. For a table with 10k rows for checking one field for possible duplicates it was taking about 15 min, which is quite long. Levenshtien does not well for string with spaces.
Which make reduces its usefulness very much.
So I searched around for few better algorithms. I found a soundex derivative 'Daitch–Mokotoff Soundex' , but it is also for surname with Slavic and Germanic support. Another one I found is Metaphone, which is suitable for most English words, but it is somewhat suitable for english related language (like spanish), but for language like japanese,korean etc.. it doesn't work at all. For strings with spaces it is not much accurate.
There is an improved version of Metaphone named Double-Metaphone. On language wise it is same as Metaphone. For strings with spaces it is much better, it identifies both "bank of india" and "bankofindia" as
same, while non of other algorithm is able to find it. Here are the queries and there output 


select soundex('bank of india'),soundex('bankofindia')
-- B520, B521
select dbo.Levenshtein('bank of india','bankofindia')
--NULL
select dbo.Metaphone('bank of india'),dbo.Metaphone('bankofindia')
--bnk of int, bnkfnt
select dbo.DoubleMetaPhone('bank of india'),dbo.DoubleMetaPhone('bankofindia')
--PNKFNPNKFN,PNKFNPNKFN

Comments

Popular posts from this blog

opensuse repair is awesome

I installed OpenSuse 11.1 on my machine. I have done kde4/kde3/gnome installation. It takes about 40-45 minutes. After instaltion I installed Nvidia driver, nvidia driver has installed new kernel (containing trace in name). When I changed boot order to make trace default kernel, grub is installed on root partition than MBR, so my system become unbootable. I got only 2 days of holidays. I come to know abt this problem after I returned to my work city (Indore) from Home (Khandwa). So I could not fix the problem. So I have to fix it via phone. My brother booted the system with Opensuse 11.1 dvd. My brother is not very technical person but he is advace-level PC user. We have never used recover/rescue installed system option. We selected automated recovery mode. It first checks all partitions and packages, all of them found is good state, setup found error boot loader configuration, we loaded boot loader configuration from disk and found that grub is installed on root partition instead of M...

Adding additional class to a button in drupal

Recently we were building a multi-domain site, we are using same theme for all sites but we want to have different colors for buttons for each domain. We achieved this using my overriding theme_button in template.php of theme, here is the snippet. /** * Overwrite theme_button() * @file template.php */ function mytheme_button($variables) { $element = $variables['element']; // Add some extra conditions to make sure we're only adding // the classto the right submit button Now add our custom class if (isset($element['#attributes'])) { $element['#attributes']['class'] [] = 'button'; $domain = domain_get_domain(); $element['#attributes']['class'] [] = $domain['machine_name']; } $variables['element'] = $element; return theme_button($variables); }

Ideas - Better interface for yast services module

Here I'm posting an Idea to improve the interface to services module of Yast. Yast is the control panel like application in OpenSuse Linux. Yast is superb piece of software, I would say it is one of the best control panel among all OS available. The current services configuration module is not very user friendly, even simple mode is very confusion for the first time users. Other distributions have much better interface. We can borrow few things from other distros. Ubuntu services manager (USM) - The interface is too simple, does not show service running status. System-config-service (SCS) - The interface is user friendly, separate tabs for background services and on demand service, large area for description and status. Windows - Shows service status in list, option for delay start of services. I think we take few things each of theme. We can use USM interface for simple mode and SCS for advanced mode. Yast is a superb control panel application but not cool, now we should mak...