Group a list of strings by Levenshtein in Python


Keywords:python 


Question: 

I created a script to calculate the Levenshtein distance of two strings. Now I want to group a list of strings in base of the Levenshtein distance. (if strings have a distance under a threshold, they will be in the same groub):

At the moment, I have done something but doesn't seems to work. Here is a pseudo-code:

for every string in list:
    create a new cluster with this string
    remove the string from the list
    for every string in the remaining list:
        if distance(string1,string2) < threshold:
             add string2 to the cluster
             remove string2 from the list

Here is the real code since a few users asked it:

cid = 0
clusters = {}

numb = range(len(mylist))
for i in numb:
        cls = [mylist[i]]
        numb.remove(i)
        for j in numb:
            if distance(mylist[i],mylist[j]) <= threshold:
                cls.append(mylist[j])
                numb.remove(j)

        clusters[cid] = cls      
        cid+=1
        cls = []

2 Answers: 

You are removing items from the list you are iterating over. That will change the indexes and cause unexpected behavior you should copy the values you need to a new list rather than modifying the list you are iterating on.



Here is some cleaned up code:

clusters = defaultdict(list)

numb = range(len(mylist))
for i in numb:
        for j in range(i+1, len(numb)):
            if distance(mylist[i],mylist[j]) <= threshold:
                clusters[i].append(mylist[j])
                clusters[j].append(mylist[i])