Group a list of strings by Levenshtein in Python



I created a script to calculate the Levenshtein distance of two strings. Now I want to group a list of strings in base of the Levenshtein distance. (if strings have a distance under a threshold, they will be in the same groub):

At the moment, I have done something but doesn't seems to work. Here is a pseudo-code:

for every string in list:
    create a new cluster with this string
    remove the string from the list
    for every string in the remaining list:
        if distance(string1,string2) < threshold:
             add string2 to the cluster
             remove string2 from the list

Here is the real code since a few users asked it:

cid = 0
clusters = {}

numb = range(len(mylist))
for i in numb:
        cls = [mylist[i]]
        for j in numb:
            if distance(mylist[i],mylist[j]) <= threshold:

        clusters[cid] = cls      
        cls = []

2 Answers: 

You are removing items from the list you are iterating over. That will change the indexes and cause unexpected behavior you should copy the values you need to a new list rather than modifying the list you are iterating on.

Here is some cleaned up code:

clusters = defaultdict(list)

numb = range(len(mylist))
for i in numb:
        for j in range(i+1, len(numb)):
            if distance(mylist[i],mylist[j]) <= threshold: