tf idf - Python: How to calculate tf-idf for a large data set


Keywords:python  2.7 


Question: 

I have a following data frame df, which I converted from sframe

   URI                                            name           text
0  <http://dbpedia.org/resource/Digby_M...        Digby Morrell  digby morrell born 10 october 1979 i...
1  <http://dbpedia.org/resource/Alfred_...       Alfred J. Lewy  alfred j lewy aka sandy lewy graduat...
2  <http://dbpedia.org/resource/Harpdog...        Harpdog Brown  harpdog brown is a singer and harmon...
3  <http://dbpedia.org/resource/Franz_R...  Franz Rottensteiner  franz rottensteiner born in waidmann...
4  <http://dbpedia.org/resource/G-Enka>                  G-Enka  henry krvits born 30 december 1974 i...

I have done the following:

from textblob import TextBlob as tb

import math

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

bloblist = []

for i in range(0, df.shape[0]):
    bloblist.append(tb(df.iloc[i,2]))

for i, blob in enumerate(bloblist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:3]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

But this is taking a lot of time as there are 59000 documents.

Is there a better way to do it?


1 Answer: 

  • I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:

  • On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :

    • I hava a word list. This list contains word and it's counts.
    • I found the average of this words counts.
    • I selected the lower limit and the upper limit with the average value.
      (e.g. lower bound = average / 2 and upper bound = average * 5)
    • Then i created a new word list with upper and lower bound.
  • With theese i got theese result :
    Before normalization word vector length : 11880
    Mean : 19 lower bound : 9 upper bound : 95
    After normalization word vector length : 1595

  • And also cosine similarity results were better.