machine learning - What are the different strategies for detecting noisy data in a pile of text?


Keywords:machine  learning 


Question: 

I have around 10 GB of text from which I extract features based on bag of words model. The problem is that the feature space is very high dimensional(1 million words) and I can not discard words based on the count of each word as both the most and least occurring words are important of the model to perform better. What are the different strategies for reducing the size of the training data and number of features while still maintaining/improving the model performance?
Edit: I want to reduce the size of the training data both because of overfitting and training time. I am using FastRank(Boosted trees) as my ML model. My machine has a core i5 processor running with 8GB RAM. The number of training instances are of the order of 700-800 million. Along with processing it takes more than an hour for the model to train. I currently do random sampling of the training and test data so as to reduce the size to 700MB or so, so that the training of the model finishes in minutes.


1 Answer: 

I'm not totally sure if this will help you because I dont know what your study is about, but if there is a logical way to divide up the 10Gigs of Text, (into documents or paragraphs) perhaps, you can try tf-idf.

This will allow you to discard words that appear very often across all partitions, and usually(the understanding is) that they dont contribute significant value to the overall document/paragraph etc.

And if your only requirement is to keep the most and least frequent words - would a standard distribution of the word frequencies help? Get rid of the average and 1 standard deviation(or whatever number you see fit).