I am performing some sentiment analysis.
I need to count the vocabulary (distinct words) in within a text.
The ngram UDF seems to do a great job at determining the unigrams. I want to know what separators it uses to determine the unigrams/ tokens. This is important if I want to mimic the vocabulary count using the split UDF instead. For example, given the following text (a product review)
I was aboslutely shocked to see how much 1 oz really was. At $7.60, I mistakenly assumed it would be a decent sized can. As locally I am able to buy a medium sized tube of wasabi paste for around $3, but never used it fast enough so it would get old. I figured a powder would be better, so I can mix it as I needed it. When I opened the box and dug thru the packing and saw this little little can, I started looking for the hidden cameras ... thought this HAD to be a joke. Nope .. and it's NOT returnable either. SO I HAVE LEARNED MY LESSON. Please just be aware if you should decide you want this EXPENSIVE wasabi powder.
The ngram UDG counts 82 unigrams/ tokens
SELECT count(*) FROM (SELECT explode(ngrams(sentences(upper(reviewtext)),1,9999999)) FROM amazon.Food_review_part_small WHERE asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR') t; 82
However, using the split UDF with space, comma, period, hyphen and double quotation marks as separators, there are 85 unigrams/tokens
select count(distinct(te)) FROM amazon.Food_review_part_small lateral view explode(split(upper(reviewtext), '[\\s,.-]|\"')) t as te WHERE te <> '' AND asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR'; 85
Of course there is little to no documentation that i can find. Does anyone know what separators the ngram UDF uses to determine unigram tokens ?