﻿ tf idf - Python: How to calculate tf-idf for a large data set - DeveloperSite- developersite.org

# tf idf - Python: How to calculate tf-idf for a large data set

Keywords：python  2.7  Question:

I have a following data frame `df`, which I converted from `sframe`

``````   URI                                            name           text
0  <http://dbpedia.org/resource/Digby_M...        Digby Morrell  digby morrell born 10 october 1979 i...
1  <http://dbpedia.org/resource/Alfred_...       Alfred J. Lewy  alfred j lewy aka sandy lewy graduat...
2  <http://dbpedia.org/resource/Harpdog...        Harpdog Brown  harpdog brown is a singer and harmon...
3  <http://dbpedia.org/resource/Franz_R...  Franz Rottensteiner  franz rottensteiner born in waidmann...
4  <http://dbpedia.org/resource/G-Enka>                  G-Enka  henry krvits born 30 december 1974 i...
``````

I have done the following:

``````from textblob import TextBlob as tb

import math

def tf(word, blob):
return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)

bloblist = []

for i in range(0, df.shape):
bloblist.append(tb(df.iloc[i,2]))

for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x, reverse=True)
for word, score in sorted_words[:3]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
``````

But this is taking a lot of time as there are `59000` documents.

Is there a better way to do it? 1 Answer:

• I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:

• On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :

• I hava a word list. This list contains word and it's counts.
• I found the average of this words counts.
• I selected the lower limit and the upper limit with the average value.
(e.g. lower bound = average / 2 and upper bound = average * 5)
• Then i created a new word list with upper and lower bound.
• With theese i got theese result :
Before normalization word vector length : 11880
Mean : 19 lower bound : 9 upper bound : 95
After normalization word vector length : 1595

• And also cosine similarity results were better.