machine learning - Best way to clean and normalise large amount of data relying on string matching algorithm



I am currently working on a data modelling project as part of my university summer project. Client data needs a lot of cleaning, since a number of columns rely on human input and have free text.

To give an example, the column Business Name has multiple entries for the same company. For "Hugo Boss" this includes "Hugo Bos", "Huggo Boss", "Hugo Boss Ltd".

I could potentially go through each row and identify all the values that have been used and create a map for each entry, however considering I am dealing with 1 million records this is very time consuming and not very ideal.

Do people know of a source code of such/ similar implementation? I've looked into matching algorithm, however they rely on an pre-computed pattern. What other matching algorithm or machine learning techniques can I use to develop an automated process that would clean the data i.e. match all the different names to a single name.

Any help would be appreciated.

4 Answers: 

This research field is called "Data Matching" or "Record Linkage".

There is a very good survey book of techniques you can use by Peter Christen. He also goes deep down into machine learning models and how to improve them from the basic approach like simple string distances (as other answers already suggested).

To give you a head start, you can try to compute character n-grams of your titles.

For n = 3 and Hugo Boss, you would get

[hug, ugo, go , o b,  bo, bos, oss]

Now you can compute the jaccard similarity between two sets of these ngrams.

Here, for example between Hugo Boss and Huggo Boss:

[hug, ugo, go , o b,  bo, bos, oss]
[hug, ugg, ggo, go , o b,  bo, bos, oss]
jaccard similarity: 0.6666666666666666

If you don't want to go down the route of implementing all of these things yourself, use Lucene. It's also very fast and scales well to billions of documents.


"Hugo Boss" this includes "Hugo Bos", "Huggo Boss", "Hugo Boss Ltd".... All of the above will have the same soundex (phonetic algorithm) values except for the last one with "LTD".

You could match soundex to the business names. This should work on "Hugo Boss" "Hugo Bos", and "Huggo Boss". However "Hugo Boss Ltd" will not match the other because of the LTD at the end. This technique has worked well for fuzzy matching where I work and the results have been useful when comparing across first names and last names to establish identity.

Keep in mind though that soundex won't work for things like social security numbers. It has a stricter domain as compared to a distance measure such as an edit distance.

You probably could also strip off things like "Ltd", "LLC", "Corp" that are common to business names in your data set. This would help a soundex matching framework because it shortens string lengths.

In addition you could compare letter ngrams as thomas recommended in his record linkage answer and this would simplify the number of ngrams to test as well.

Here is the NYSIIS algorithm:

The algorithm, as described in New York State Identification and Intelligence System:

1. Translate first characters of name: MAC → MCC, KN → N, K → C, PH, PF → FF, SCH → SSS
2. Translate last characters of name: EE → Y, IE → Y, DT, RT, RD, NT, ND → D
3. First character of key = first character of name.
4. Translate remaining characters by following rules, incrementing by one character each time:
    1. EV → AF else A, E, I, O, U → A
    2. Q → G, Z → S, M → N
    3. KN → N else K → C
    4. SCH → SSS, PH → FF
    5. H → If previous or next is non-vowel, previous.
    6. W → If previous is vowel, A.
    7. Add current to key if current is not same as the last key character.
5. If last character is S, remove it.
6. If last characters are AY, replace with Y.
7. If last character is A, remove it.
8. Append translated key to value from step 3 (removed first character)
9. If longer than 6 characters, truncate to first 6 characters. (only needed for true NYSIIS, some versions use the full key)

Soundex packages are found in many high level programming languages. In python you can try fuzzy package:

#!/usr/bin/env python

import fuzzy

names = [ 'Catherine', 'Katherine', 'Katarina',
      'Johnathan', 'Jonathan', 'John',
      'Teresa', 'Theresa',
      'Smith', 'Smyth',

for n in names:
    print '%-10s' % n, fuzzy.nysiis(n)


$ python
Catherine  CATARAN
Katherine  CATARAN
Katarina   CATARAN
Johnathan  JANATAN
Jonathan   JANATAN
John       JAN
Teresa     TARAS
Theresa    TARAS
Smith      SNATH
Smyth      SNATH
Jessica    JASAC
Joshua     JAS

The example above can be found here:

You can key and match ngrams or the full names.

Finally you can use the mode name in the data or some other method to normalize the name field.


Another option is to look at Levenshtein distance

That will help you with such cases as Hugo Boss vs Huggo Boss but won't work for Hugo Boss vs The Hugo Boss


You can try aho-corasick finite state machine and augment it with a wildcard. Another option is a suffix tree, i.e. a levensthein distance. You can try my PHP implementation of aho-corasick @ .