r - Bigram analysis and Term document Matrix


Keywords:r 


Question: 

I am.doing a bigram analyis on my text corpus. My feature vector is a predefined set of bigram and unigram tokens.

Feature vector = ( good location, bad experience, clean, unfriendly, tidy, excellent, beautiful place)

my text : location is good but unfriendly staff.

Cleaned text : location good unfriendly staff.

I created a tdf using the above dictionary and cleaned text but the "location good" bigram is not giving a "1". But when I changed the cleaned text to "good location unfriendly staff". In a bigram analysis do the order of the words matter and why ? or am i messing up with the code ? Kindly clarify

"bad experience" "tidy" "clean" "good location" "excellent" "beautiful" "place" "unfriendly"

0 0 0 0 0 0 1 -- location good but unfriendly staff.

0 0 0 1 0 0 1 -- good location but unfriendly staff.


1 Answer: 

As far as my experience goes the order of words in n-grams is critical. You would not want to consider the n-grams 'Putin attacked' and "attacked Putin" to be the same as they have very different contextual meaning.

So no you are not messing up the code. You just may want to do a little more research into n-gram models. A good start may be with Chapter 4 in Speech and Language Processing by Jurafsky and Martin