﻿ Calculating levenshtein distance within a list Python - 开发者网 - DeveloperSite.cn

# Calculating levenshtein distance within a list Python

Keywords：python

Question:

I have a list of strings and I want to filter out the strings that are too similar based on levenstein distance. So if `lev(list[0], list[10]) < 50`; then `del list[10]`. Is there any way I can calculate such distance between every pair of strings in the list, more efficiently?? Thanks!!

``````data2= []
for i in data:
for index, j in enumerate(data):
s = levenshtein(i, j)
if s < 50:
del data[index]
data2.append(i)
``````

The rather dumb code above is taking too long to compute...

What if we kept only the indexes of the hit-strings and just skipped them later? I ignore how much `enumerate() and del()` weigh and what the percentage of hits is (i.e. how many strings must be removed from your dataset).

``````THRESHOLD = 50
data = ["hel", "how", "are", "you"] # replace with your dataset

tbr = {} # holds the index of the strings to be removed
idx = 0
for i in data:
for j in xrange(len(data)):
if j != idx and levenshtein(i, data[j]) < THRESHOLD:
tbr[j] = True
idx += 1

# print tbr
data2 = []
idx = -1
for d in data:
idx += 1
if idx in tbr:
continue # skip this string
data2.append(d)
# print data2
``````