Calculating levenshtein distance within a list Python


Keywords:python 


Question: 

I have a list of strings and I want to filter out the strings that are too similar based on levenstein distance. So if lev(list[0], list[10]) < 50; then del list[10]. Is there any way I can calculate such distance between every pair of strings in the list, more efficiently?? Thanks!!

data2= []
for i in data:
    for index, j in enumerate(data):
        s = levenshtein(i, j)
        if s < 50:
            del data[index]
    data2.append(i)

The rather dumb code above is taking too long to compute...


1 Answer: 

What if we kept only the indexes of the hit-strings and just skipped them later? I ignore how much enumerate() and del() weigh and what the percentage of hits is (i.e. how many strings must be removed from your dataset).

THRESHOLD = 50
data = ["hel", "how", "are", "you"] # replace with your dataset

tbr = {} # holds the index of the strings to be removed
idx = 0
for i in data:
    for j in xrange(len(data)):
        if j != idx and levenshtein(i, data[j]) < THRESHOLD:
            tbr[j] = True
    idx += 1

# print tbr
data2 = []
idx = -1
for d in data:
    idx += 1
    if idx in tbr:
        continue # skip this string
    data2.append(d)
# print data2